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Abstract Mixture modelling involves explaining some observed evidence using a combination of probability distributions. 
The crux of the problem is the inference of an optimal number of mixture components and their corresponding parameters. 
This paper discusses unsupervised learning of mixture models using the Bayesian Minimum Message Length (MML) crite¬ 
rion. To demonstrate the effectiveness of search and inference of mixture parameters using the proposed approach, we select 
two key probability distributions, each handling fundamentally different types of data: the multivariate Gaussian distribution 
to address mixture modelling of data distributed in Euclidean space, and the multivariate von Mises-Fisher (vMF) distribu¬ 
tion to address mixture modelling of directional data distributed on a unit hypersphere. The key contributions of this paper, in 
addition to the general search and inference methodology, include the derivation of MML expressions for encoding the data 
using multivariate Gaussian and von Mises-Fisher distributions, and the analytical derivation of the MML estimates of the 
parameters of the two distributions. Our approach is tested on simulated and real world data sets. For instance, we infer vMF 
mixtures that concisely explain experimentally determined three-dimensional protein conformations, providing an effective 
null model description of protein structures that is central to many inference problems in structural bioinformatics. The ex¬ 
perimental results demonstrate that the performance of our proposed search and inference method along with the encoding 
schemes improve on the state of the art mixture modelling techniques. 

Keywords mixture modelling • minimum message length • multivariate Gaussian • von Mises-Fisher • directional data 


1 Introduction 

Mixture models are common tools in statistical pattern recognition (McLachlan and Basford, 1988). They offer a mathe¬ 
matical basis to explain data in fields as diverse as astronomy, biology, ecology, engineering, and economics, amongst many 
others (McLachlan and Peel, 2000). A mixture model is composed of component probabilistic models; a component may 
variously correspond to a subtype, kind, species, or subpopulation of the observed data. These models aid in the identifi¬ 
cation of hidden patterns in the data through sound probabilistic formalisms. Mixture models have been extensively used 
in machine learning tasks such as classification and unsupervised learning (Titterington et al., 1985; McLachlan and Peel, 
2000; Jain et al., 2000). 

Formally, mixture modelling involves representing a distribution of data as a weighted sum of individual probability 
distributions. Specifically, the problem we consider here is to model the observed data using a mixture ^ of probability 
distributions of the form: 

M 

Pr(x;^) = Y, (1) 

>=i 

where x is a J-dimensional datum, M is the number of mixture components, wj and fj{x;0j) are the weight and probability 
density of the component respectively; the weights are positive and sum to one. 

The problem of modelling some observed data using a mixture distribution involves determining the number of compo¬ 
nents, M, and estimating the mixture parameters. Inferring the optimal number of mixture components involves the difficult 
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problem of balancing the trade-off between two conflicting objectives: low hypothesis complexity as determined by the num¬ 
ber of components and their respective parameters, versus good quality of jit to the observed data. Generally a hypothesis 
with more free parameters can fit observed data better than a hypothesis with fewer free parameters. A number of strategies 
have been used to control this balance as discussed in Section 7. These methods provide varied formulations to assess the 
mixture components and their ability to explain the data. Methods using the minimum message length criterion (Wallace and 
Boulton, 1968), a Bayesian method of inductive inference, have been proved to be effective in achieving a reliable balance 
between these conflicting aims (Wallace and Boulton, 1968; Oliver et ah, 1996; Roberts et ah, 1998; Figueiredo and Jain, 
2002 ). 

Although much of the literature concerns the theory and application of Gaussian mixtures (McLachlan and Peel, 2000; 
Jain and Dubes, 1988), mixture modelling using other probability distributions has been widely used. Some examples are: 
Poisson models (Wang et ah, 1996; Wedel et ah, 1993), exponential mixtures (Seidel et ah, 2000), Laplace (Jones and 
McLachlan, 1990), t-distribution (Peel and McLachlan, 2000), Weibull (Patra and Dey, 1999), Kent (Peel et ah, 2001), von 
Mises-Fisher (Banerjee et ah, 2005), and many more. McLachlan and Peel (2000) provide a comprehensive summary of finite 
mixture models and their variations. The use of Gaussian mixtures in several research disciplines has been partly motivated 
by its computational tractability (McLachlan and Peel, 2000). For datasets where the direction of the constituent vectors is 
important, Gaussian mixtures are inappropriate and distributions such as the von Mises-Fisher may be used (Banerjee et ah, 
2005; Mardia et ah, 1979). In any case, whatever the kind of distribution used for an individual component, one needs to 
estimate the parameters of the mixture, and provide a sound justification for selecting the appropriate number of mixture 
components. 

Software for mixture modelling relies on the following elements: 

1. An estimator of the parameters of each component of a mixture, 

2. An objective function, that is a cost or score, that can be used to compare two hypothetical mixtures and decide which is 

better, and 

3. A search strategy for the best number of components and their weights. 

Traditionally, parameter estimation is done by strategies such as maximum likelihood (ML) or Bayesian maximum a poste¬ 
riori probability (MAP) estimation. In this work, we use the Bayesian minimum message length (MML) principle. Unlike 
MAP, MML estimators are invariant under non-linear transformations of the data (Oliver and Baxter, 1994), and unlike ML, 
MML considers the number and precision of a model’s parameters. It has been used in the inference of several probability 
distributions (Wallace, 2005). MML-based inference operates by considering the problem as encoding first the parameter 
estimates and then the data given those estimates. The parameter values that result in the least overall message length to 
explain the whole data are taken as the MML estimates for an inference problem. The MML scheme thus incorporates the 
cost of stating parameters into model selection. It is self evident that a continuous parameter value can only be stated to 
some finite precision; the cost of encoding a parameter is determined by its prior and the precision. ML estimation ignores 
the cost of stating a parameter and MAP based estimation uses the probability density of a parameter instead of its probabil¬ 
ity measure. In contrast, the MML inference process calculates the optimal precision to which parameters should be stated 
and a probability value of the corresponding parameters is then computed. This is used in the computation of the message 
length corresponding to the parameter estimates. Thus, models with varying parameters are evaluated based on their resul¬ 
tant total message lengths. We use this characteristic property of MML to evaluate mixtures containing different numbers of 
components. 

Although there have been several attempts to address the challenges of mixture modelling, the existing methods are 
shown to have some limitations in their formalisms (see Section 7). In particular, some methods based on MML are incom¬ 
plete in their formulation. We aim to rectify these drawbacks by proposing a comprehensive MML formulation and develop 
a search heuristic that selects the number of mixture components based on the proposed formulation. To demonstrate the 
effectiveness of the proposed search and parameter estimation, we first consider modelling problems using Gaussian mix¬ 
tures and extend the work to include relevant discussion on mixture modelling of directional data using von Mises-Fisher 
distributions. 

The importance of Gaussian mixtures in practical applications is well established. For a given number of components, 
the conventional method of estimating the parameters of a mixture relies on the expectation-maximization (EM) algorithm 
(Dempster et ah, 1977). The standard EM is a local optimization method, is sensitive to initialization, and in certain cases 
may converge to the boundary of parameter space (Krishnan and McLachlan, 1997; Figueiredo and Jain, 2002). Previous 
attempts to infer Gaussian mixtures based on the MML framework have been undertaken using simplifying assumptions, 
such as the covariance matrices being diagonal (Oliver et ah, 1996), or coarsely approximating the probabilities of mixture 
parameters (Roberts et ah, 1998; Figueiredo and Jain, 2002). Further, the search heuristic adopted in some of these methods 
is to run the EM several times for different numbers of components, M, and select the M with the best EM outcome (Oliver 
et ah, 1996; Roberts et ah, 1998; Biernacki et ah, 2000). A search method based on iteratively deleting components has been 
proposed by Figueiredo and Jain (2002). It begins by assuming a very large number of components and selectively eliminates 
components deemed redundant; there is no provision for recovering from deleting a component in error. 

In this work, we propose a search method which selectively splits, deletes, or merges components depending on im¬ 
provement to the MML objective function. The operations, combined with EM steps, result in a sensible redistribution of 
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data between the mixture components. As an example, a component may be split into two children, and at a later stage, one 
of the children may be merged with another component. Unlike the method of Figueiredo and Jain (2002), our method starts 
with a one-component mixture and alters the number of components in subsequent iterations. This avoids the overhead of 
dealing with a large number of components unless required. 

The proposed search heuristic can be used with probability distributions for which the MML expressions to calculate 
message lengths for estimates and for data given those estimates are known. As an instance of this. Section 8 discusses 
modelling directional data using the von Mises-Fisher distributions. Directional statistics have garnered support recently 
in real data mining problems where the direction, not magnitude, of vectors is significant. Examples of such scenarios are 
found in earth sciences, meteorology, physics, biology, and other fields (Mardia and Jupp, 2000). The statistical properties 
of directional data have been studied using several types of distributions (Fisher, 1953; Watson and Williams, 1956; Fisher, 
1993; Mardia and Jupp, 2000), often described on surfaces of compact manifolds, such as the sphere, ellipsoid, torus etc. 
The most fundamental of these is the von Mises-Fisher (vMF) distribution which is analogous to a symmetric multivariate 
Gaussian distribution, wrapped around a unit hypersphere (Watson and Williams, 1956). The probability density function of 
a vMF distribution with parameters 0 = (ju, (c) = (mean direction, concentration parameter) for a random unit vector x 6 
on a (<i — 1) - dimensional hypersphere is given by: 

f{x;fi,K)=Cd{K)e’^'^'^^ ( 2 ) 


^d/2-\ 

where Cd(K) = -nx- is the normalization constant and L is a modified Bessel function of the first kind and 

{27Z)d/%^2-l{K) 

order v. 

The estimation of the parameters of the vMF distribution is often done using maximum likelihood. However, the complex 
nature of the mathematical form presents difficulty in estimating the concentration parameter K. This has lead to researchers 
using many different approximations, as discussed in Section 3. Most of these methods perform well when the amount of 
data is large. At smaller sample sizes, they result in inaccurate estimates of K, and are thus unreliable. We demonstrate this 
by the experiments conducted on a range of sample sizes. The problem is particularly evident when the dimensionality of the 
data is large, also affecting the application in which it is used, such as mixture modelling. We aim to rectify this issue by using 
MML estimates for K. Our experiments section demonstrates that the MML estimate of K provides a more reliable answer 
and is an improvement on the current state of the art. These MML estimates are subsequently used in mixture modelling of 
vMF distributions (see Sections 8 and 11). 

Previous studies have established the importance of von Mises circular (two-dimensional) and von Mises-Fisher (three- 
dimensional and higher) mixtures, and demonstrated applications to clustering of protein dihedral angles (Mardia et ak, 
2007; Dowe et al., 1996a), large-scale text clustering (Banerjee et al., 2003), and gene expression analyses (Banerjee et ak, 
2005). The merit of using cosine based similarity metrics, which are closely related to the vMF, for clustering high dimen¬ 
sional text data has been investigated in Strehl et ak (2000). For text clustering, there is evidence that vMF mixture models 
have a superior performance compared to other statistical distributions such as normal, multinomial, and Bernoulli (Salton 
and Buckley, 1988; Salton and McGill, 1986; Zhong and Ghosh, 2003; Banerjee et ak, 2005). movMF is a widely used 
package to perform clustering using vMF distributions (Hornik and Griin, 2013). 


Contributions: The main contributions of this paper are as follows: 

- We derive the analytical estimates of the parameters of a multivariate Gaussian distribution with full covariance matrix, 
using the MML principle (Wallace and Freeman, 1987). 

- We derive the expression to infer the concentration parameter (C of a generic J-dimensional vMF distribution using 
MML-based estimation. We demonstrate, through a series of experiments, that this estimate outperforms the previous 
ones, therefore making it a reliable candidate to be used in mixture modelling. 

- A generalized MML-based search heuristic is proposed to infer the optimal number of mixture components that would 
best explain the observed data; it is based on the search used in various versions of the ’Snob’ classification program (Wal¬ 
lace and Boulton, 1968; Wallace, 1986; Jorgensen and McLachlan, 2008). We compare it with the work of Figueiredo 
and Jain (2002) and demonstrate its effectiveness. 

- The search implements a generic approach to mixture modelling and allows, in this instance, the use of d-dimensional 
Gaussian and vMF distributions under the MML framework. It infers the optimal number of mixture components, and 
their corresponding parameters. 

- Further, we demonstrate the effectiveness of MML mixture modelling through its application to high dimensional text 
clustering and clustering of directional data that arises out of protein conformations. 

The rest of the paper is organized as follows: Sections 2 and 3 describe the respective estimators of Gaussian and 
vMF distributions that are commonly used. Section 4 introduces the MML framework for parameter estimation. Section 5 
outlines the derivation of the MML parameter estimates of multivariate Gaussian and vMF distributions. Section 6 describes 
the formulation of a mixture model using MML and the estimation of the mixture parameters under the framework. Section 7 
reviews the existing methods for selecting the mixture components. Section 8 describes our proposed approach to determine 
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the number of mixture components. Section 9 depicts the competitive performance of the proposed MML-based search 
through experiments conducted with Gaussian mixtures. Section 10 presents the results for MML-based vMF parameter 
estimation and mixture modelling followed by results supporting its applications to text clustering and protein structural data 
in Section 11. 


2 Existing methods of estimating the parameters of a Gaussian distrihution 


The probability density function / of a <f-variate Gaussian distribution is given as 


/(x;m,c) = 


1 


{ 2 k )2 |C| 2 




( 3 ) 


where ju, C are the respective mean, (symmetric) covariance matrix of the distribution, and |C| is the determinant of the 
covariance matrix. The traditional method to estimate the parameters of a Gaussian distribution is by maximum likelihood. 
Given data D = {xi,... ,xn}, where x; 6 the log-likelihood is given by 


C) = -^ log(2;r) - ^ log |C| - ^(x; - /i)^c ' (x; - ju) 


( 4 ) 


To compute the maximum likelihood estimates. Equation (4) needs to be maximized. This is achieved by computing the 
gradient of the log-likelihood function with respect to the parameters and solving the resultant equations. The gradient 
vector of with respect to jti and the gradient matrix of with respect to C are given below. 


P) N 

^.^=^ = EC-'(x,-m) 


( 5 ) 

( 6 ) 


/=1 


The maximum likelihood estimates are then computed by solving V = 0 and = 0 and are given as: 


1 V ^ N 


1=1 


N‘ 


( 7 ) 


i=i 


Cml is known to be a biased estimate of the covariance matrix (Barton, 1961; Basu, 1964; Eaton and Morris, 1970; 
White, 1982) and issues related with its use in mixture modelling have been documented in Gray (1994) and Lo (2011). An 
unbiased estimator of C was proposed by Barton (1961) and is given below. 

1 ^ . 

Gunbiased — 7 (x/ /X ) (x/ /X ) (8) 

tv - 1 

In addition to the maximum likelihood estimates, Bayesian inference of Gaussian parameters involving conjugate priors 
over the parameters has also been dealt with in the literature (Bishop, 2006). However, the unbiased estimate of the covariance 
matrix, as determined by the sample covariance (Equation (8)), is typically used in the analysis of Gaussian distributions. 


3 Existing methods of estimating the parameters of a von Mises-Eisher distrihution 

Eor a von Mises-Eisher (vMF) distribution / characterized by Equation (2), and given data D = {xi,... ,xai}, such that 
x; 6 *, the log-likelihood is given by 


Ji'{D\n,K) = AlogQ(K-) -I- (c/x^R 


( 9 ) 


N 

where N is the sample size and R = ^ x, (the vector sum). Let R denote the magnitude of the resultant vector R and let 

i—l 

fi and k be the maximum likelihood estimators of /x and K respectively. Under the condition that jix is a unit vector, the 
maximum likelihood estimates are obtained by maximizing as follows: 


k = A-UR) where A^jic) = - 

’ Cd{k) N 


. R 

^=R^ 


(10) 
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Solving the non-linear equation; F{k) = Aii{k) — R = 0 yields the corresponding maximum likelihood estimate where 


Ad{K) 


4/2-1 ('^) 


(11) 


represents the ratio of Bessel functions. Because of the difficulties in analytically solving Equation (10), there have been 
several approaches to approximating k (Mardia and Jupp, 2000). Each of these methods is an improvement over their 
respective predecessors. Tanabe et al. (2007) is an improvement over the estimate proposed by Banerjee et al. (2005). Sra 
(2012) is an improvement over Tanabe et al. (2007) and Song et al. (2012) fares better when compared to Sra (2012). The 
methods are summarized below. 


3.1 Banerjee et al. (2005) 


The approximation given by Equation (12) is due to Banerjee et al. (2005) and provides an easy to use expression for k. The 
formula is very appealing as it eliminates the need to evaluate complex Bessel functions. Banerjee et al. (2005) demonstrated 
that this approximation yields better results compared to the ones suggested in Mardia and Jupp (2000). It is an empirical 
approximation which can be used as a starting point in estimating the root of Equation (10). 


Kb 


R{d-R^) 

1-^2 


( 12 ) 


3.2 Tanabe et al. (2007) 


The approximation given by Equation (13) is due to Tanabe et al. (2007). The method utilizes the properties of Bessel 
functions to determine the lower and upper bounds for k and uses a fixed point iteration function in conjunction with linear 
interpolation to approximate k. The bounds for k are given by 


Ki 


R{d-2) 

1-F 


<k<Ku = 


Rd 

l-R^ 


Tanabe et al. (2007) proposed to use a fixed point iteration function defined as (j) 2 d{K} = RKAd(K) ^ and used this to 
approximate k as 

Kl^2d{.Ku) - Ku^2d{Kl) 


Kt = 


- hd{Kl)) - {Ku - Ki) 


(13) 


3.3 Sra (2012) : Truncated Newton approximation 


This a heuristic approximation provided by Sra (2012). It involves refining the approximation given by Banerjee et al. (2005) 
(Equation (12)) by performing two iterations of Newton’s method. Sra (2012) demonstrate that this approximation fares 
well when compared to the approximation proposed by Tanabe et al. (2007). The following two iterations result in K/v, the 
approximation proposed by Sra (2012): 


Ki = Kb- 


F{kb) 

F'(kb) 


and K/v = Ki 


F{Kl) 


where F'{k) =A'^{k) = 1 -Ad{Kf-— — —Ad{K) 

K* 


(14) 

(15) 


3.4 Song et al. (2012) : Truncated Halley approximation 


This approximation provided by Song et al. (2012) uses Halley’s method which is the second order expansion of Taylor’s 
series of a given function F{k). The higher order approximation results in a more accurate estimate as demonstrated by 
Song et al. (2012). The iterative Halley’s method is truncated after iterating through two steps of the root finding algorithm 
(similar to that done by Sra (2012)). The following two iterations result in Kh, the approximation proposed by Song et al. 
( 2012 ): 


2F(kb)F'{kb) 

2F'(kb)^-F{kb)F"(kb) 


and Kh = M 


2F(kx)F'(k,) 

2F'{kiY-F{kx)F"{ki) 


(16) 


where 


F"(K)=J^:i{K)=2Ad(KY + 


3(d- 1) 


Ad{K)^ + 


(d^-d-2KY 


Ad{K) 


(d-l) 


K 


K 


(17) 
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The common theme in all these methods is that they try to approximate the maximum likelihood estimate governed by 
Equation (10). It is to be noted that the maximum likelihood estimators (of concentration parameter (c) have considerable 
bias (Schou, 1978; Best and Fisher, 1981; Cordeiro and Vasconcellos, 1999). To counter this effect, we explore the minimum 
message length based estimation procedure. This Bayesian method of estimation not only results in an unbiased estimate 
but also provides a framework to choose from several competing models (Wallace and Freeman, 1987; Wallace, 1990). 
Through a series of empirical tests, we demonstrate that the MML estimate is more reliable than any of the contemporary 
methods. Dowe et al. (1996c) have demonstrated the superior performance of the MML estimate for a three-dimensional 
vMF distribution. We extend their work to derive the MML estimators for a generic <7-dimensional vMF distribution and 
compare its performance with the existing methods. 


4 Minimum Message Length (MML) Inferenee 

4.1 Model selection using Minimum Message Length 


Wallace and Boulton (1968) developed the first practical criterion for model selection to be based on information theory. 
The resulting framework provides a rigorous means to objectively compare two competing hypotheses and, hence, choose 
the best one. As per Bayes’s theorem, 

Pr(7/&Z)) = Pr{H) x Pr{D\H) = Pr(D) x Pr{H\D) 

where D denotes some observed data, and H some hypothesis about that data. Further, Pr{H &D) is the joint probability of 
data D and hypothesis H, Pr{H) is the prior probability of hypothesis H, Pr(/)) is the prior probability of data D, Pr{H\D) is 
the posterior probability of H given D, and Pr{D\H) is the likelihood. 

MML uses the following result from information theory (Shannon, 1948): given an event or outcome E whose probabil¬ 
ity is Pr(Zi), the length of the optimal lossless code I{E) to represent that event requires I{E) = — log 2 (Pr(£')) bits. Applying 
Shannon’s insight to Bayes’s theorem, Wallace and Boulton (1968) got the following relationship between conditional prob¬ 
abilities in terms of optimal message lengths: 


I{H&D) = 1(H) +I{D\H) = I(D)+I(H\D) 


(18) 


As a result, given two competing hypotheses H and H', 

AI = I{H&D) - I(H'&D) = I{H\D) - I{H'\D) = log 2 bits. 


Pv(H'\D) 

Pr{H\D) 


(19) 


gives the posterior log-odds ratio between the two competing hypotheses. Equation (18) can be intrepreted as the total cost 
to encode a message comprising the hypothesis H and data D. This message is composed over into two parts: 


1. Eirsl part: the hypothesis H, which takes 1(H) bits, 

2. Second part: the observed data D using knowledge of H, which takes I(D\H) bits. 


Clearly, the message length can vary depending on the complexity of H and how well it can explain D. A more complex 
H may fit (i.e., explain) D better but take more bits to be stated itself. The trade-off comes from the fact that (hypothetically) 
transmitting the message requires the encoding of both the hypothesis and the data given the hypothesis, that is, the model 
complexity 1(H) and the goodness of fit I(D\H). 


4.2 Minimum message length based parameter estimation 

Our proposed method of parameter estimation uses the MML inference paradigm. It is a Bayesian method which has been 
applied to infer the parameters of several statistical distributions (Wallace, 2005). We apply it to infer the parameter estimates 
of multivariate Gaussian and vMF distributions. Wallace and Freeman (1987) introduced a generalized scheme to estimate 
a vector of parameters 0 of any distribution / given data D. The method involves choosing a reasonable prior h(0) on the 
hypothesis and evaluating the determinant of the Fisher information matrix \.l^(0)\ of the expected second-order partial 
derivatives of the negative log-likelihood function, —^(D\0). The parameter vector 0 that minimizes the message length 
expression (Equation (20)) is the MML estimate according to Wallace and Freeman (1987). 

1(0,D) = ^ \ogqp - log (^^7==) -^(D\0) + ^ (20) 

1 ( 0 ) 


I(D|0) 
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where p is the number of free parameters in the model, and pp is the lattice quantization constant (Conway and Sloane, 1984) 
in p-dimensional space. The total message length 7(0,D) in MML framework is composed of two parts: 

1. the statement cost of encoding the parameters, 7(0) and 

2. the cost of encoding the data given the parameters, 7(7)|0). 


A concise description of the MML method is presented in Oliver and Baxter (1994). 

We note here a few key differences between MML and ML/MAP based estimation methods. In maximum likelihood 
estimation, the statement cost of parameters is ignored, in effect considered constant, and minimizing the message length 
corresponds to minimizing the negative log-likelihood of the data (the second part). In MAP based estimation, a probability 
density rather than the probability measure is used. Continuous parameters can necessarily only be stated only to finite 
precision. MML incorporates this in the framework by determining the region of uncertainty in which the parameter is 


located. The value of 




p/2 


s /\^{0)\ ^ measure of the volume of the region of uncertainty in which the parameter 0 

is centered. This multiplied by the probability density h{0) gives the probability of a particular 0 and is proportional 
h{@) 

to =. This probability is used to compute the message length associated with encoding the continuous valued 

V 

parameters (to a finite precision). 


5 Derivation of the MML parameter estimates of Gaussian and von Mises-Fisher distributions 

Based on the MML inference process discussed in Section 4, we now proceed to formulate the message length expressions 
and derive the parameter estimates of Gaussian and von Mises-Fisher distributions. 


5.1 MML-based parameter estimation of a multivariate Gaussian distribution 

The MML framework requires the statement of parameters to a finite precision. The optimal precision is related to the Fisher 
information and in conjunction with a reasonable prior, the probability of parameters is computed. 

5.7.7 Prior probability of the parameters 

A flat prior is usually chosen on each of the d dimensions of /i (Roberts et ah, 1998; Oliver et ah, 1996) and a conjugate 
inverted Wishart prior is chosen for the covariance matrix C (Gauvain and Lee, 1994; Agusta and Dowe, 2003; Bishop, 
2006). The joint prior density of the parameters is then given as 

fi(ju,C)c=c icr"^ (21) 


5.7.2 Fisher information of the parameters 

The computation of the Fisher information requires the evaluation of the second order partial derivatives of —.if(D|jU,C). 
Let |,^(jU,C)| represent the determinant of the Fisher information matrix. This is approximated as the product of 
and \^{C)\ (Oliver et ah, 1996; Roberts et ah, 1998), where and are the respective determinants of Fisher 

information matrices due to the parameters fl and C. Differentiating the gradient vector in Equation (5) with respect to fl, 
we have: 

-Vl^ = NC-^ 

Hence, \^{n)\ = N‘‘\C\-^ (22) 

To compute \ j^(C)\, Magnus and Neudecker (1988) derived an analytical expression using the theory of matrix derivatives 
based on matrix vectorization (Dwyer, 1967). Let C = [cy] VI < ij<d where c,y denotes the element corresponding to the 
row and /* column of the covariance matrix. Let v(C) = (cn,... ,ci^;,C 22 , • • • ,C 2 d, ■ ■ - jCdd) be the vector containing the 
d(d+l) 

- - -free parameters that completely describe the symmetric matrix C. Then, the Fisher information due to the vector 

of parameters v(C) is equal to \^(C)\ and is given by Equation (23) (Magnus and Neudecker, 1988; Bozdogan, 1990; Drton 
et ah, 2009). 

|,^(C)| = (23) 

Multiplying Equations (22) and (23), we have 


(24) 
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5.1.3 Message length formulation 

To derive the message length expression to encode data using a certain /X,C, substitute Equations (4), (21), and (24), in 

d(d + 3) 

Equation (20) using the number of free parameters of the distribution as p = ^ ^. Hence, 

('A/' — 1) 1 ^ 

I{ll,C,D) = ——^ log ICI + - ^(x/ - jU)^C“^ (x/ - ju) + constant (25) 

To obtain the MML estimates of jj. and C, Equation (25) needs to be minimized. The MML estimate of ju is same as the 
maximum likelihood estimate (given in Equation (7)). To compute the MML estimate of C, we need to compute the gradient 
matrix of I{fl , C, D) with respect to C and is given by Equation (26) 

Vc/ = ^ - If C-'(x,-/x)(x,-/x)^C-i (26) 

The MML estimate of C is obtained by solving VqI = 0 (given in Equation(27)). 


Vc/ = 0 


Cmml — 


^ i=l 


We observe that the MML estimate Cmml is same as the unbiased estimate of the covariance matrix C, thus, lending 
credibility for its preference over the traditional ML estimate (Equation (7)). 


5.2 MML-based parameter estimation of a von Mises-Fisher distribution 

Parameter estimates for two and three-dimensional vMF have been explored previously (Wallace and Dowe, 1994; Dowe 
et al., 1996b,c). MML estimators of three-dimensional vMF were explored in Dowe et al. (1996c), where they demonstrate 
that the MML-based inference compares favourably against the traditional ML and MAP based estimation methods. We use 
the Wallace and Freeman (1987) method to formulate the objective function (Equation (20)) corresponding to a generic vMF 
distribution. 

5.2.7 Prior probability of the parameters 

Regarding choosing a reasonable prior (in the absence of any supporting evidence) for the parameters 0 = (ju, (c) of a vMF 
distribution, Wallace and Dowe (1994) and Dowe et al. (1996c) suggest the use of the following “colourless prior that is 
uniform in direction, normalizable and locally uniform at the Cartesian origin in tc”: 

h{n,K)^ (28) 

(1 -I- (C2) 2 

5.2.2 Fisher information of the parameters 

Regarding evaluating the Fisher information, Dowe et al. (1996c) argue that in the general (7-dimensional case, 

= {NKAd{K)Y^^ xNAd{K) (29) 

where A^fic) and A'^k) are described by Equations (11) and (15) respectively. 

5.2.3 Message length formulation 

Substituting Equations (9), (28) and (29) in Equation (20) with number of free parameters p = d,'we have the net message 
length expression: 

I{H,k,D) = ^ log '^^'^'^^ +^\ogA'd{K)+ + kY - NlogCd{K) - Kfl'^R + constant (30) 

To obtain the MML estimates of ju and K, Equation (30) needs to be minimized. The estimate for /X is same as the maximum 
likelihood estimate (Equation (10)). The resultant equation in K that needs to be minimized is then given by: 

log -I- \^\ogA'd{K) + log(l + kY — N\ogCd{K) — (c7?-I-constant (31) 

A ^ 2. A 
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To obtain the MML estimate of K, we need to differentiate Equation (31) and set it to zero. 


Let 


G(k) 



jd-l) 

2k 


{d+i)K 
1 + K^ 


id-l)A'jiK) i A'^jK) 
2 A,iK) 2 A'^{k) 


+ NAd{K)-R 


(32) 


The non-linear equation: G{k) =0 does not have a closed form solution. We try both the Newton and Halley’s method to find 
an approximate solution. We discuss both variants and comment on the effects of the two approximations in the experimental 
results. To be fair and consistent with Sra (2012) and Song et al. (2012), we use the initial guess of the root as Kb (Equation 
(12)) and iterate twice to obtain the MML estimate. 

1. Approximation using Newton’s method: 


Ki = Kb- 


G{kb) 

G'{kb) 


and Kmn = M 


G(M) 

g'(m) 


(33) 


2. Approximation using Halley’s method: 


2G[kb)G'{kb) 

2G'{kbY-G{kb)G''{kb) 


and Kmh = M 


2G(m)G'(jci) 

2G'(ici)2-G(ici)G"(ici) 


(34) 


The details of evaluating G'(ic) and G"{k) are discussed in Appendix 13.1. 

Equation (33) gives the MML estimate (Kmn) using Newton’s method and Equation (34) gives the MML estimate (Kmh) 
using Halley’s method. We used these values of MML estimates in mixture modelling using vMF distributions. 


6 Minimum Message Length Approaeh to Mixture Modelling 

Mixture modelling involves representing an observed distribution of data as a weighted sum of individual probability density 
functions. Specifically, the problem we consider here is to model the mixture distribution ^ as defined in Equation (1). 
Eor some observed data D = {xi ,.. .,xsi} (N is the sample size), and a mixture ,/#, the log-likelihood using the mixture 
distribution is as follows: 

N M 

^(D\^) = Y^\ogY^Wjfj{xr,©j) (35) 

i=l >=1 

where ^ = {wi, • • • , wm,©!, • • • ,©m}, wj and fj{x\©j) are the weight and probability density of the f' component respec¬ 
tively. Eor a fixed M, the mixture parameters 4> are traditionally estimated using a standard expectation-maximization(EM) 
algorithm (Dempster et al., 1977; Krishnan and McLachlan, 1997). This is briefly discussed below. 


6.1 Standard EM algorithm to estimate mixture parameters 


The standard EM algorithm is based on maximizing the log-likelihood function of the data (Equation (35)). The maximum 

likelihood estimates are then given as ^ml = argmax Because of the absence of a closed form solution for ^ml, 

•P 

a gradient descent method is employed where the parameter estimates are iteratively updated until convergence to some local 
optimum is achieved (Dempster et al., 1977; McLachlan and Basford, 1988; Xu and Jordan, 1996; Krishnan and McLachlan, 
1997; McLachlan and Peel, 2000). The EM method consists of two steps: 

- E-step: Each datum x, has fractional membership to each of the mixture components. These partial memberships of the 
data points to each of the components are defined using the responsibility matrix 


r 


U ~ 


Wjf{xi;@j) 

If=iw,/(x,■;©,)’ 


VI <f<A^,l <i<M 


(36) 


where r,y denotes the conditional probability of a datum x,- belonging to the component. The effective membership 
associated with each component is then given by 


N M 

nj = Y^rij and 

1=1 j=i 


- M-step: Assuming be the estimates at some iteration t, the expectation of the log-likelihood using and the 
partial memberships is then maximized which is tantamount to computing the updated maximum likelihood 

n^‘^ 

estimates for the next iteration (t -|- 1). The weights are updated as Wj = -A—. 
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The above sequence of steps are repeated until a certain convergence criterion is satisfied. At some intermediate iteration t, 
the mixture parameters are updated using the corresponding ML estimates and are given below. 

- Gaussian: The ML updates of the mean and covariance matrix are 




^ E '-Jx/ and = 4) E '■? 4 - ' 


n. 


nj 


- von Mises-Fisher: The resultant vector sum is updated as Ry = E represents the magnitude of vector 




,(f+i) 


then the updated mean and concentration parameter are 


/=1 


ih+i) _ 


h+i) 


nh+l) _ 

(r+1) ’ 


R 


R 


(f+i) 


.(0 




6.2 EM algorithm to estimate mixture parameters using MML 

We will first describe the methodology involved in formulating the MML-based objective function. We will then discuss how 
EM is applied in this context. 


6.2.1 Encoding a mixture model using MML 


We refer to the discussion in Wallace (2005) to briefly describe the intuition behind mixture modelling using MML. Encoding 
of a message using MML requires the encoding of (1) the model parameters and then (2) the data using the parameters. The 
statement costs for encoding the mixture model and the data can be decomposed into: 

1. Encoding the number of components M: In order to encode the message losslessly, it is required to initially state the 
number of components. In the absence of background knowledge, one would like to model the prior belief in such a way 
that the probability decreases for increasing number of components. If h{M) cx2-", then I{M) = A/log2 + constant. The 
prior reflects that there is a difference of one bit in encoding the numbers M and M +1. Alternatively, one could assume 
a uniform prior over M within some predefined range. The chosen prior has little effect as its contribution is minimal 
when compared to the magnitude of the total message length (Wallace, 2005). 

2. Encoding the weights wi, • • • ,wm which are treated as parameters of a multinomial distribution with sample size ny, VI < 
j < M. The length of encoding all the weights is then given by the expression (Boulton and Wallace, 1969): 


/(w) 


{M-i) 

2 


log A- 


2 M 

- ElogWj- 
>=1 


(M- 1)! 


(38) 


3. 

4. 


Encoding each of the component parameters ©j as given by I{©j) 


-log 


m) 

vWm 


(discussed in Section 4.2). 


Encoding the data: each datum x,- can be stated to a finite precision which is dictated by the accuracy of measurement*. 
If the precision to which each element of a d-dimensional vector can be stated is e, then the probability of a datum 
X; 6 R'* is given as Pr(x,) = e‘*Pr(x,j,.#) where Pr(x,j,/#) is the probability density given by Equation (1). Hence, the 
total length of its encoding is given by 


^(Xi) 


-logPr(x,) 


—dXoge 


M 

logE'^f//(x/|0j) 


1=1 


The entire data D can now be encoded as: 


/(D|4>) 


N M 

—Ndloge — 

'=1 1=1 


(39) 


(40) 


^ We note that e is a constant value and has no effect on the overall inference process. It is used in order to maintain the theoretical validity when 
making the distinction between probability and probability density. 
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Thus, the total message length of a M component mixture is given by Equation (41). 

M 

I{4>,D) = I{M) +/(w) + ^ /(0j) +I{D\4>) +constant 
>=i 

( M M \ 

— ^ log/i(0j) + - ^ log |^(0j)| ]+/(D|^) + constant (41) 

J=1 ^7=1 / 

Note that the constant term includes the lattice quantization constant (resulting from stating all the model parameters) in 
a p-dimensional space, where p is equal to the number of free parameters in the mixture model. 


6.2.2 Estimating the mixture parameters 


The parameters of the mixture model are those that minimize Equation (41). To achieve this we use the standard EM algorithm 
(Section 6.1), where, iteratively, the parameters are updated using their respective MML estimates. The component weights 
are obtained by differentiating Equation (41) with respect to wj under the constraint Ljli 't'y = 1 • The derivation of the MML 
updates of the weights is shown in Appendix 13.2 and are given as: 


„('+!) 


^ 2 

iV+f 


(42) 


The parameters of the y* component are updated using rf) and n^p (Equations (36), (37)), the partial memberships assigned 
to the y* component at some intermediate iteration t and and are given below. 

- Gaussian: The MML updates of the mean and covariance matrix are 




(^+1) _ 


1 yv 1 iv T 


- von Mises-Fisher: The resultant vector sum is updated as Ry = T,f=i If represents the magnitude of 

vector then the updated mean is given by Equation (44). 


M 


h+i) _ 


h+i) 




(44) 


The MML update of the concentration parameter is obtained by solving = 0 after substituting N nP 

and R —> in Equation (32). 

The EM is terminated when the change in the total message length (improvement rate) between successive iterations is 
less than some predefined threshold. The difference between the two variants of standard EM discussed above is firstly the 
objective function that is being optimized. In Section 6.1, the log-likelihood function is maximized which corresponds to 
I{D\^) term in Section 6.2. Equation (41) includes additional terms that correspond to the cost associated with stating the 
mixture parameters. Secondly, in the M-step, in Section 6.1, the components are updated using their ML estimates whereas 
in Section 6.2, the components are updated using their MML estimates. 


6.3 Issues arising from the use of EM algorithm 

The standard EM algorithms outlined above can be used only when the number of mixture components M is fixed or 
known a priori. Even when the number of components are fixed, EM has potential pitfalls. The method is sensitive to the 
initialization conditions. To overcome this, some reasonable start state for the EM may be determined by initially clustering 
the data (Krishnan and McLachlan, 1997; McLachlan and Peel, 2000). Another strategy is to run the EM a few times and 
choose the best amongst all the trials. Eigueiredo and Jain (2002) point out that, in the case of Gaussian mixture modelling, 
EM can converge to the boundary of the parameter space when the corresponding covariance matrix is nearly singular or 
when there are few initial members assigned to that component. 
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7 Existing methods of inferring the number of mixture eomponents 

Inferring the “right” number of mixture components for unlabelled data has proven to be a thorny issue (McLachlan and 
Peel, 2000) and there have been numerous approaches proposed that attempt to tackle this problem (Akaike, 1974; Schwarz 
et al., 1978; Rissanen, 1978; Bozdogan, 1993; Oliver et al., 1996; Roberts et al., 1998; Biernacki et al., 2000; Figueiredo 
and Jain, 2002). Given some observed data, there are infinitely many mixtures that one can fit to the data. Any method 
that aims to selectively determine the optimal number of components should be able to factor the cost associated with the 
mixture parameters. To this end, several methods based on information theory have been proposed where there is some form 
of penalty associated with choosing a certain parameter value (Wallace and Boulton, 1968; Akaike, 1974; Schwarz et al., 
1978; Wallace and Freeman, 1987; Rissanen, 1989). We briefly review some of these methods and discuss the state of the 
art and then proceed to explain our proposed method. 


7.1 AIC (Akaike, 1974) & BIC (Schwarz et al., 1978) 

AIC in the simplest form adds the number of free parameters p to the negative log-likelihood expression. There are some 
variants of AIC suggested (Bozdogan, 1983; Burnham and Anderson, 2002). However, these variants introduce the same 
penalty constants for each additional parameter: 

AlC{p)=p-^{D\^) 

BIC, similar to AIC, adds a constant multiple of | log A (N being the sample size), for each free parameter in the model. 

BIC(p) = ^\o^N - ^(D\^) 

Rissanen (1978) formulated minimum description length (MDL) which formally coincides with BIC (Oliver et al., 1996; 
Figueiredo and Jain, 2002). 

7.1.1 Formulation of the scoring functions 

AIC and BIC/MDL serve as scoring functions to evaluate a model and its corresponding fit to the data. The formulations 
suggest that the parameter cost associated with adopting a model is dependent only on the number of free parameters and 
not on the parameter values themselves. In other words, the criteria consider all models of a particular type (of probability 
distribution) to have the same statement cost associated with the parameters. For example, a generic d-dimensional Gaussian 
distribution has p = free parameters. All such distributions will have the same parameter costs regardless of their 

characterizing means and covariance matrices, which is an oversimplifying assumption which can hinder proper inference. 

The criteria can be interpreted under the MML framework wherein the first part of the message is a constant multiplied 
by the number of free parameters. AIC and BIC formulations can be obtained as approximations to the two-part MML for¬ 
mulation governed by Equation (20) (Figueiredo and Jain, 2002). It has been argued that for tasks such as mixture modelling, 
where the number of free parameters potentially grows in proportion to the data, MML is known in theory to give consistent 
results as compared to AIC and BIC (Wallace, 1986; Wallace and Dowe, 1999). 

7.1.2 Search method to determine the optimal number of mixture components 

To determine the optimal number of mixture components M, the AIC or BIC scores are computed for mixtures with varying 
values of M. The mixture model with the least score is selected as per these criteria. 

A (i-variate Gaussian mixture with M number of components has p = g.gg parameters. All mix¬ 

tures with a set number of components have the same cost associated with their parameters using these criteria. The mixture 
complexity is therefore treated as independent of the constituent mixture parameters. In contrast, the MML formulation 
incorporates the statement cost of losslessly encoding mixture parameters by calculating their relevant probabilities as dis¬ 
cussed in Section 6. 


7.2 MML Unsupervised (Oliver et al., 1996) 

7.2.1 Formulation of the scoring function 

A MML-based scoring function akin to the one shown in Equation (41) was used to model Gaussian mixtures. However, the 
authors only consider the specific case of Gaussians with diagonal covariance matrices, and fail to provide a general method 
dealing with full covariance matrices. 
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7.2.2 Search method to determine the optimal number of mixture components 

A rigorous treatment on the selection of number of mixture components M is lacking. Oliver et al. (1996) experiment with 
different values of M and choose the one which results in the minimum message length. For each M, the standard EM 
algorithm (Section 6.1) was used to attain local convergence. 


7.3 Approximate Bayesian (Roberts et al., 1998) 

The method, also referred to as Laplace-empirical criterion (EEC) (McLachlan and Peel, 2000), uses a scoring function 
derived using Bayesian inference and serves to provide a tradeoff between model complexity and the quality of fit. The 
parameter estimates are those that result in the minimum value of the following scoring function. 

- logP(Z),^) = +Md\og(2a^al) - log(M- 1)! - y log(2;r) + ^ log (45) 

where D is the dataset, (75| 4>) is the negative log-likelihood given the mixture parameters 4>, M is the number of mixture 
components, d the dimensionality of the data, a,l5 are hyperparameters (which are set to 1 in their experiments). Op is a pre¬ 
defined constant or is pre-computed using the entire data, H{^) is the Hessian matrix which is equivalent to the empirical 
Fisher matrix for the set of component parameters, and p is the number of free parameters in the model. 


7.3.1 Formulation of the scoring function 


The formulation in Equation (45) can be obtained as an approximation to the message length expression in Equation (41) by 
identifying the following related terms in both equations. 


1. /(w) —5-— log(M—1)! 

2. For a d-variate Gaussian with mean ju and covariance matrix C, the joint prior h{fl,C) is calculated as follows: 

- Prior on fl: Each of the d parameters of the mean direction are assumed to be have uniform priors in the range 

{—aOp,aOp), so that the prior density of the mean is h{fl) = —- - 

yZCCGp ) 

- Prior on C.' It is assumed that the prior density is dependent only on the diagonal elements in C. Each diagonal 
covariance element is assumed to have a prior in the range {0,j5Op) so that the prior on C is considered to be 

h{C) = ^ 


{^OpY 

The joint prior, is therefore, assumed to be h{fl,C) = 
Thus, - I" 1 log h{0j) Mdlog(2a/3(j2) 


1 


(lapajy 


3. 


2lf=ilog|=^(0;)|^ilog \H\ 


4. constant —> | log(2;r) 

Although the formulation is an improvement over the previously discussed methods, there are some limitations due to the 
assumptions made while proposing the scoring function: 


- While computing the prior density of the covariance matrix, the off-diagonal elements are ignored. 

- The computation of the determinant of the Fisher matrix is approximated by computing the Hessain |//|. It is to be noted 
that while the Hessian is the observed information (data dependent), the Fisher information is the expectation of the 
observed information. MML formulation requires the use of the expected value. 

- Further, the approximated Hessian was derived for Gaussians with diagonal covariances. For Gaussians with full covari¬ 
ance matrices, the Hessian was approximated by replacing the diagonal elements with the corresponding eigen values 
in the Hessian expression. The empirical Fisher computed in this form does not guarantee the charactersitic invariance 
property of the classic MML method (Oliver and Baxter, 1994). 


7.3.2 Search method to determine the optimal number of mixture components 

The search method used to select the optimal number of components is rudimentary. The optimal number of mixture com¬ 
ponents is chosen by running the EM 10 times for every value of M within a given range. An optimal M is selected as the 
one for which the best of the 10 trials results in the least value of the scoring function. 
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7.4 Integrated Complete Likelihood (ICL) (Biernacki et al., 2000) 

The ICL criterion maximizes the complete log-likelihood (CL) given by 

N M 

CL(D, 4>) = ^(/)|d>) - ^ Zijlognj 

i=lj=l 

where ^{D\^) is the log-likelihood (Equation (35)), is the responsibility term (Equation (36)), and Zij = 1 if x,- arises 

N M 

from component j and zero otherwise. The term EE Zij log Vij is explained as the estimated mean entropy. 

i=l j=l 


7.4.1 Formulation of the scoring function 

The ICL criterion is then defined as: ICL{^,M) = CL{^) — ^ logA^,where p is the number of free parameters in the model. 
We observe that similar to BIC, the ICL scoring function penalizes each free parameter by a constant value and does not 
account for the model parameters. 

7.4.2 Search method to determine the optimal number of mixture components 

The search method adopted in this work is similar to the one used by (Roberts et al., 1998). The EM algorithm is initiated 
20 times for each value of M with random starting points and the best amongst those is chosen. 


7.5 Unsupervised Learning of Finite Mixtures (Figueiredo and Jain, 2002) 


The method uses the MML criterion to formulate the scoring function given by Equation (46). The formulation can be 
intrepreted as a two-part message for encoding the model parameters and the observed data. 


N, 


M 


( Nw, 


/(A4i) = yElog(^ —j+-log- + 


M. 


N M{Np + l) 


second part 


(46) 


first part 


where Np is the number of free parameters per component and wj is the component weight. 
7.5.1 Formulation of the scoring function 


The scoring function is derived from Equation (41) by assuming the prior density of the component parameters to be a 
Jeffreys prior. If 0j is the vector of parameters describing the y* component, then the prior density h{Qj) cx y/JW{©j)\ 
(Jeffreys, 1946). Similarly, a prior for weights would result in h(wi,... ,wm) {wi.. These assumptions are used 

in the encoding of the parameters which correspond to the first part of the message. 

We note that the scoring function is consistent with the MML scheme of encoding parameters and the data using those 
parameters. However, the formulation can be improved by amending the assumptions as detailed in in Section 5. Further, the 
assumptions made in Figueiredo and Jain (2002) have the following side effects: 


- The value of — log 


hjOj) 

VWW]) 


gives the cost of encoding the component parameters. By assuming h[@j) cx yl\W{0j) 


the message length associated with using any vector of parameters 0j is essentially treated the same. To avoid this, the 
use of independent uniform priors over non-informative Jeffreys’s priors was advocated previously (Oliver et al., 1996; 
Lee, 1997; Roberts et al., 1998). The use of Jeffreys prior offers certain advantages, for example, not having to compute 
the Fisher information (Jeffreys, 1946). However, this is crucial and cannot be ignored as it dictates the precision of 
encoding the parameter vector. Wallace (2005) states that “Jeffreys, while noting the interesting properties of the prior 
formulation did not advocate its use as a genuine expression of prior knowledge.” 

By making this assumption, Figueiredo and Jain (2002) “sidestep"’ the difficulty associated with explicitly computing 
the Fisher information associated with the component parameters. Hence, for encoding the parameters of the entire 
mixture, only the cost associated with encoding the component weights is considered. 

- The code length to state each 0y is, therefore, greatly simplified as {Np/2)\og{Nwj) (notice the sole dependence on 
weight Wj). Figueiredo and Jain (2002) interpret this as being similar to a MDL formulation because Nwj gives the 
expected number of data points generated by the y* component. This is equivalent to the BIC criterion discussed earlier. 
We note that MDL/BIC are highly simplified versions of MML formulation and therefore. Equation (46) does not capture 
the entire essence of complexity and goodness of fit accurately. 
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7.5.2 Search method to determine the optimal number of mixture components 

The method begins by assuming a large number of components and updates the weights iteratively in the EM steps as 

max|o,n;— 

= VM /n - k\ 

where nj is the effective membership of data points in 7 * component (Equation (37)). A component is annihilated when 
its weight becomes zero and consequently the number of mixture components decreases. We note that the search method 
proposed by Figueiredo and Jain (2002) using the MML criterion is an improvement over the methods they compare against. 
However, we make the following remarks about their search method. 

- The method updates the weights as given by Equation (47). During any iteration, if the amount of data allocated to a 
component is less than Np /2, its weight is updated as zero and this component is ignored in subsequent iterations. This 
imposes a lower bound on the amount of data that can be assigned to each component. As an example, for a Gaussian 
mixture in 10-dimensions, the number of free parameters per component is Np = 65, and hence the lower bound is 33. 
Hence, in this exampe, if a component has 30 data, the mixture size is reduced and these data are assigned to some 
other component(s). Consider a scenario where there are 50 observed 10 dimensional data points originally generated 
by a mixture with two components with equal mixing proportions. The method would always infer that there is only one 
component regardless of the separation between the two components. This is clearly a wrong inference! (see Section 9.4 
for the relevant experiments). 

- Once a component is discarded, the mixture size decreases by one, and it cannot be recovered. Because the memberships 
nj are updated iteratively using an EM algorithm and because EM might not always lead to global optimum, it is 
conceivable that the updated values need not always be optimal. This might lead to situations where a component is 
deleted owing to its low prominence. There is no provision to increase the mixture size in the subsequent stages of the 
algorithm to account for such behaviour. 

- The method assumes a large number of initial components in an attempt to be robust with respect to EM initialization. 
However, this places a significant overhead on the computation due to handling several components. 

Summary: We observe that while all these methods (and many more) work well within their defined scope, they are incom¬ 
plete in achieving the true objective that is to rigorously score models and their ability to fit the data. The methods discussed 
above can be seen as different approximations to the MML framework. They adopted various simplifying assumptions and 
approximations. To avoid such limitations, we developed a classic MML formulation, giving the complete message length 
formulations for Gaussian and von Mises-Fisher distributions in Section 5. 

Secondly, in most of these methods, the search for the optimal number of mixture components is achieved by selecting 
the mixture that results in the best EM outcome out of many trials (Akaike, 1974; Schwarz et al., 1978; Oliver et ak, 1996; 
Roberts et ak, 1998; Biernacki et ak, 2000). This is not an elegant solution and Figueiredo and Jain (2002) proposed a search 
heuristic which integrates estimation and model selection. A comparative study of these methods is presented in McLachlan 
and Peel (2000). Their analysis suggested the superior performance of ICE (Biernacki et ak, 2000) and EEC (Roberts et ak, 
1998). Later, Figueiredo and Jain (2002) demonstrated that their proposed method outperforms the contemporary methods 
based on ICE and EEC and is regarded as the current state of the art. We, therefore, compare our method against that of 
Figueiredo and Jain (2002) and demonstrate its effectiveness. 

With this background, we formulate an alternate search heuristic to infer the optimal number of mixture components 
which aims to address the above limitations. 


8 Proposed approach to infer an optimal mixture 

The space of candidate mixture models to explain the given data is infinitely large. As per the MML criterion (Equation (41)), 
the goal is to search for the mixture that has the smallest overall message length. We have seen in Section 6.2 that if the 
number of mixture components are fixed, then the EM algorithm can be used to estimate the mixture parameters, namely the 
component weights and the parameters of each component. However, here it is required to search for the optimal number of 
mixture components along with the corresponding mixture parameters. 

Our proposed search heuristic extends the MML-based Snob program (Wallace and Boulton, 1968; Wallace, 1986; 
Jorgensen and McLachlan, 2008) for unsupervised learning. We define three operations, namely split, delete, and merge that 
can be applied to any component in the mixture. 
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Algorithm 1: Achieve an optimal mixture model 


current ■<— one-component-mixture 
while true do 

components ^ current mixture components 
M ^ number ofcomponents 

for \ to M do 

I splits[i] •<— Sp\it{current,components[i]) 

BestSplit ^ best(splits) 

if M > 1 then 

for \ to M do 

I deletes[i\ ^'D^Xoi^icurrent^ components[i]) 

BestDelete best(deletes) 
for \ to M do 

j ^ closest-component(i) 
merges[i] ^Merge(currentd,j) 

BestMerge best(merges) 

Be St Perturbation ■<— best(BestSplit.BestDelete.BestMerge) 

A1 ^ mQ's,?, 2 LgidAQOgi\\{BestPerturbation) — message_length(cwrr^rt?) 
if A/ < 0 then 

current ^ Best Perturbation 
continue 
else 

I break 


2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 return current 


/* exhaustively split all components */ 

/* remember the best split */ 

/* exhaustively delete all components */ 

/* remember the best deletion */ 
/* exhaustively merge all components */ 

/* remember the best merge */ 
/* select the best perturbation */ 
/* check for improvement */ 


8.1 The complete algorithm 

The pseudocode of our search method is presented in Algorithm 1. The basic idea behind the search strategy is to perturb a 
mixture from its current suboptimal state to obtain a new state (if the perturbed mixture results in a smaller message length). 
In general, if a (current) mixture has M components, it is perturbed using a series of Split, Delete, and Merge operations 
to check for improvement. Each component is split and the new (M+ 1)-component mixture is re-estimated. If there is an 
improvement {i.e., if there is a decrease in message length with respect to the current mixture), the new (M+ l)-component 
mixture is retained. There are M splits possible and the one that results in the greatest improvement is recorded (lines 5 - 
7 in Algorithm 1). A component is first split into two sub-components (children) which are locally optimized by the EM 
algorithm on the data that belongs to that sole component. The child components are then integrated with the others and 
the mixture is then optimized to generate a M-h 1 component mixture. The reason for this is, rather than use random initial 
values for the EM, it is better if we start from some already optimized state to reach to a better state. Similarly, each of the 
components is then deleted, one after the other, and the (M— l)-component mixture is compared against the current mixture. 
There are M possible deletions and the best amongst these is recorded (lines 8 - 11 in Algorithm 1). Finally, the components 
in the current mixture are merged with their closest matches (determined by calculating the KL-divergence) and each of 
the resultant {M — 1)-component mixtures are evaluated against the M component mixture. The best among these merged 
mixtures is then retained (lines 12 - 15 in Algorithm 1). 

We initially start by assuming a one component mixture. This component is split into two children which are locally op¬ 
timized. If the split results in a better model, it is retained. For any given M-component mixture, there might be improvement 
due to splitting, deleting and/or merging its components. We select the perturbation that best improves the current mixture. 
This process is repeated until there is no further improvement possible, and the algorithm is continued. The notion of best or 
improved mixture is based on the amount of reduction of message length that the perturbed mixture provides. In the current 
state, the observed data have partial memberships in each of the M components. Before the execution of each operation, these 
memberships need to be adjusted and a EM is subsequently carried out to achieve an optimum with a different number of 
components. We will now examine each operation in detail and see how the memberships are affected after each operation. 


8.2 Strategic operations employed to determine an optimal mixture model 

Let R = [r/y] be the N x M responsibility (membership) matrix and wj be the weight of component in mixture 

1. Split (Line 6 in Algorithm 1): As an example, assume a component with index a 6 {^^M} and weight Wa in the current 
mixture ^ is split to generate two child components. The goal is to find two distinct clusters amongst the data associated 
with component a. It is to be noted that the data have fractional memberships to component a. The EM is therefore, 
carried out within the component a assuming a two-component sub-mixture with the data weighted as per their current 
memberships r/ct- The remaining M— 1 components are untouched. An EM is carried out to optimize the two-component 
sub-mixture. The initial state and the subsequent updates in the Maximization-step are described below. 
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Parameter initialization of the two-component sub-mixture: The goal is to identify two distinct clusters within the com¬ 
ponent a. For Gaussian mixtures, to provide a reasonable starting point, we compute the direction of maximum variance 
of the parent component and locate two points which are one standard deviation away on either side of its mean (along 
this direction). These points serve as the initial means for the two children generated due to splitting the parent compo¬ 
nent. Selecting the initial means in this manner ensures they are reasonably apart from each other and serves as a good 
starting point for optimizing the two-component sub-mixture. The memberships are initialized by allocating the data 
points to the closest of the two means. Once the means and the memberships are initialized, the covariance matrices of 
the two child components are computed. 

There are conceivably several variations to how the two-component sub-mixture can be initialized. These include 
random initialization, selecting two data points as the initial component means, and many others. However, the reason 
for selecting the direction of maximum variance is to utilize the available characteristic of data, i.e., the distribution 
within the component a. For von Mises-Fisher mixtures, the maximum variance strategy (as for Gaussian mixtures) 
cannot be easily adopted, as the data is distributed on the hypershpere. Hence, in this work, we randomly allocate data 
memberships and compute the components’ (initial) parameters. 

Once the parameters of the sub-mixture are initialized, an EM algorithm is carried out (just for the sub-mixture) 
with the following Maximization-step updates. Let be the N x2 responsibility matrix for the two-component 

sub-mixture. For k 6 {1,2}, let n^'^ be the effective memberships of data belonging to the two child components, let 

(k) Ik) 

Wa be the weights of the child components within the sub-mixture, and let 0„ be the parameters describing the child 

components. 

- The effective memberships are updated as given by Equation (48). 

^ = L and = N (48) 

/=! 


- As the sub-mixture comprises of two child components, substitute M = 2 in Equation (42) to obtain the updates for 
the weights. These are given by Equation (49). 


(k) I 1 

(k) na + 2 . (1) , (2) , 

Wa = and wy+w^a 


(49) 


- For Gaussian mixtures, the component parameters ©a ^ = are updated as follows; 


N 

52 k’iat'iipi-i _ 

- and = — 




Y,riarl{xi- (L 

E - 1 




(50) 


- For von Mises-Fisher mixtures, the component parameters ©a ^ = are updated as follows: 


= where = £ ri„4xi 


R 


i=l 


(51) 


(k) (k) r^(k) 

Ra represents the magnitude of vector • The update of the concentration parameter Ka is obtained by solving 

G{Ka'’) = 0 after substituting N r,a4 and R —> Ra'^ in Equation (32). 

The difference between the EM updates in Equations (43), (44) and Equations (50), (51) is the presence of the 
coefficient r,a4 with each x,. Since we are considering the sub-mixture, the original responsibility is multiplied by 
the responsibility within the sub-mixture 4 to quantify the influence of datum x,- to each of the child components. 

After the sub-mixture is locally optimized, it is integrated with the untouched M — I components of ^ to result 
in a M-|- 1 component mixture . An EM is finally carried out on the combined M 1 components to estimate the 
parameters of and result in an optimized (M+ 1)-component mixture as follows. 


EM initialization for : Usually, the EM is started by a random initialization of the members. However, because the 
two-component sub-mixture is now optimal and the M — 1 components in ^ are also in an optimal state, we exploit this 
situation to initialize the EM (for ^') with a reasonable starting point. As mentioned above, the component with index 
a with component weight Wa is split. Upon integration, the (child) components that replaced component a will now 
correspond to indices a and a -|- 1 in the new mixture Let R' = VI < i < A, 1 < } < M-|- 1 be the responsibility 
matrix for the new mixture and let Wj be the component weights in 
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- Component weights: The weights are initialized as follows: 

w'j = Wj if j < cc 

! W A ' ( 2 ) 

w„ = WaWa and Wa+i = WaWa 

w'j = wj-i if y>a+l (52) 

- Memberships: The responsibility matrix R' is initialized for all data x, Vl < i < iV as follows: 

r'ij = nj if j<a 
r'ia = nara and 

r'ij = rij-l if j>a + l 

N 

and w'- = ^ V1 < _/ < M + 1 (53) 

/—I 

where n'- are the effective memberships of the components in . 

With these starting points, the parameters of are estimated using the traditional EM algorithm with updates in 
the Maximization-step given by Equations (42), (43), and (44). The EM results in local convergence of the (M + 1)- 
component mixture. If the resultant message length of encoding data using is lower than that due to that means 
the perturbation of ^ because of splitting component a resulted in a new mixture that compresses the data better, 
and hence, is a better mixture model to explain the data. 

2. Delete (Line 10 in Algorithm 1): The goal here is to remove a component from the current mixture and check whether it 
results in a better mixture model to explain the observed data. Assume the component with index a and the correspond¬ 
ing weight Wa is to be deleted from ^ to generate a M — 1 component mixture Once deleted, the data memberships 
of the component need to be redistributed between the remaining components. The redistribution of data results in a 
good starting point to employ the EM algorithm to estimate the parameters of as follows. 


EM initialization for : Let = [r|y] be the A x (M — 1) responsibility matrix for the new mixture and let w'j be 
the weight of y* component in . 

- Component weights: The weights are initialized as follows: 


w' = 


! 

Wj = 


l-Wa 

'^j+l 

\-Wa 


if i<a 
if i>a 


(54) 


It is to be noted that Wa 1 because the MML update expression in the M-step for the component weights always 
ensures non-zero weights during every iteration of the EM algorithm (see Equation (42)). 

- Memberships: The responsibility matrix R' is initialized for all data x, Vl < i < A as follows: 


and 


! _ ^»(t'+l) 
I-La 


if j<a 
if j>a 


N 



<j<M-l 


(55) 


where «'■ are the effective memberships of the components in . It is possible for a datum x,- to have complete 
membership in component a (i.e., ria = 1), in which case, its membership is equally distributed among the other 

M — \ components (i.e., rj- =-, V_/ e {1,M— 1}). 

With these readjusted weights and memberships, and the constituent M— 1 components, the traditional EM algorithm 
is used to estimate the parameters of the new mixture . If the resultant message length of encoding data using is 
lower than that due to that means the perturbation of ^ because of deleting component a resulted in a new mixture 
with better explanatory power, which is an improvement over the current mixture. 

3. Merge (Line 14 in Algorithm 1): The idea is to join a pair of components of and determine whether the resulting 
(M— 1)-component mixture is any better than the current mixture One strategy to identify an improved mixture 
model would be to consider merging all possible pairs of components and choose the one which results in the greatest 
improvement. This would, however, lead to a runtime complexity of 0{M^), which could be significant for large values 
of M. Another strategy is to consider merging components which are “close” to each other. Eor a given component, we 
identify its closest component by computing the Kullback-Leibler (KL) distance with all others and selecting the one 
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with the least value. This would result in a linear runtime complexity of 0{M) as computation of KL-divergence is a 
constant time operation. For every component in its closest match is identified and they are merged to obtain a M — 1 
component mixture Merging the pair involves reassigning the component weights and the memberships. An EM 
algorithm is then employed to optimize 

Assume components with indices a and jS are merged. Let their weights be Wa and wp ; and their responsibility 
terms be ria and , 1 < i < respectively. The component that is formed by merging the pair is determined first. It is 
then integrated with the M— 2 remaining components of ^ to produce a {M — 1)-component mixture 

EM initialization for : Let and be the weight and responsibility vector of the merged component m respec¬ 

tively. They are given as follows: 

=Wa + Wp 

= na + ni3A<i<N (56) 

The parameters of this merged component are estimated as follows: 

- Gaussian: The parameters are: 


H = - and = ^ 


N 

I 

/=1 


(57) 




- von Mises-Fisher: The parameters 


At''"' = 


R(" 

W’ 


where = E 

i=l 


(58) 


The concentration parameter is obtained by solving = 0 after substituting N —)■ and R —>■ 

in Equation (32). 

The merged component m with weight responsibility vector and parameters ©('") is then integrated with 
the M — 2 components. The merged component and its associated memberships along with the M — 2 other components 
serve as the starting point for optimizing the new mixture . If results in a lower message length compared to ^ 
that means the perturbation of ^ because of merging the pair of components resulted in an improvement to the current 
mixture. 


8.3 Illustrative example of our search procedure 


We explain the proposed inference of mixture components through the fol¬ 
lowing example that was also considered by Figueiredo and Jain (2002). 

Consider a bivariate Gaussian mixture shown in Fig. 1. The mixture has 
three components with equal weights of 1/3 each and their means at (-2,0), 

(0,0), and (2,0). The covariance matrices of the three components are the 
same and are equal to diag{2,0.2}. We simulate 900 data points from this 
mixture (as done by Figueiredo and Jain (2002)) and employ the proposed 
search strategy. The progression of the search method using various opera¬ 
tions is detailed below. 

Search for the optimal mixture model: The method begins by inferring a 
one-component mixture Pi (see Fig. 2(a)). It then splits this component (as 
described in Split step of Section 8.2) and checks whether there is an im¬ 
provement in explanation. The red ellipse in Fig 2(b) depicts the component 
being split. The direction of maximum variance (dotted black line) is first 
identified, and the means (shown by black dots at the end of the dotted line) 
are initialized. An EM algorithm is then used to optimize the two children 
and this results in a mixture P 2 shown in Eig 2(c). Since the new mixture has 
a lower message length, the current is updated as P 2 . 

In the second iteration, each component in P 2 is iteratively split, deleted, 
and merged. Fig. 3 shows the splitting (red) of the first component. On splitting, the new mixture P 3 results in a lower 
message length. Deletion of the first component is shown in Fig. 4. Before merging the first component, we identify its 



Fig. 1 Original mixture consisting of three compo¬ 
nents with equal mixing proportions. 
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XXX 
(a) (b) (c) 

Fig. 2 (a) P\: the one-component mixture after the first iteration (message length I — 22793 bits) (b) Red colour denotes the component being 
split. The dotted line is the direction of maximum variance. The black dots represent the initial means of the two-component sub-mixture (c) P 2 '. 
optimized mixture post-EM phase (/ = 22673 bits) results in an improvement. 




(a) 


(b) 


(c) 


Fig. 3 Second iteration: splitting the first component in P 2 (I = 22673 bits) (a) Initial means (shown by the black dots) (b) Optimized child mixture 
(denoted by red dashed lines) along with the second component of the parent (denoted by black dashes) (7 = 22691 bits) (c) P^ : stabilized mixture 
post-EM phase (7 = 22460 bits) results in a further improvement of message length. 


closest component (the one with the least KL-divergence) (see Fig. 5). Deletion and merging operations, in this case, do not 
result in an improvement. These two operations have different intermediate EM initializations (Figures 4(b) and 5(b)) but 
result in the same optimized one-component mixture. The same set of operations are performed on the second component 
in P 2 . In this particular case, splitting results in an improved mixture (same as P 3 ). P 3 is updated as the new parent and the 
series of split, delete, and merge operations are carried out on all components in P 3 . Fig. 6 shows these operations on the first 
component. We see that splitting the first component in P 3 results in P 4 (see Fig. 6 (c)). However, P 4 is not an improvement 
over P 3 as seen by the message lengths and is, therfore, discarded. Similarly, deletion and merging of the components do not 
yield improvements to P 3 . The operations are carried out on the remaining two components in P 3 (not shown in the figure) 
too. These perturbations do not produce improved mixtures in terms of the total message length. Since the third iteration 
does not result in any further improvement, the search terminates and the parent P 3 is considered to be the best mixture. 

In different stages of the search method, we have different intermediate mixtures. EM is a gradient descent technique 
and it can get trapped in a local optimum. By employing the suggested search, we are exhaustively considering the possible 
options, and aiming to reduce the possibility of the EM getting stuck in a local optimum. The proposed method infers a 
mixture by balancing the tradeoff due to model complexity and the fit to the data. This is particularly useful when there is 
no prior knowledge pertaining to the nature of the data. In such a case, this method provides an objective way to infer a 
mixture with suitable components that best explains the data through lossless compression. Another example is shown in 
Appendix 13.4, where the evolution of the inferred mixture is explored in the case of a mixture with overlapping components. 

Variation of the two-part message length: The search method infers three components and terminates. In order to demonstrate 
that the inferred number of components is the optimum number, we infer mixtures with increasing number of components 
(until M = 15 as an example) and plot their resultant message lengths. For each M > 3, the standard EM algorithm (Sec¬ 
tion 6 . 2 ) is employed to infer the mixture parameters. 
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(a) 


(b) 


(c) 


Fig. 4 Second iteration: deleting the first component in P 2 (a) Green ellipse denotes the component being deleted (b) EM initialization with the 
remaining component (/ = 25599 bits) (c) Resultant mixture after deletion and post EM (7 = 22793 bits) - no improvement. 




(a) 


(b) 


(c) 


Fig. 5 Second iteration: merging the two components in P 2 (a) Blue ellipses denote the components currently being merged (b) EM initialization 
with one merged component along with its parameters (c) Optimized mixture after merging (7 = 22793 bits) - no improvement 


Fig. 7 shows the total message lengths to which the 
EM algorithm converges for varying number of compo¬ 
nents M. As expected, the total message length (green 
curve) drastically decreases initially until M = 3 compo¬ 
nents are inferred. Starting from M = A, the total mes¬ 
sage length gradually increases, clearly suggesting that 
the inferred models are over-fitting the data with increas¬ 
ing statement cost to encode the additional parameters 
of these (more complex) models. We further elaborate 
on the reason for the initial decrease and subsequent in¬ 
crease in the total message length. As per MML eval¬ 
uation criterion, the message length comprises of two 
parts - statement cost for the parameters and the cost for 
stating the data using those parameters. The model com¬ 
plexity (which corresponds to the mixture parameters) in¬ 
creases with increasing M. Therefore, the first part of the 
message to encode parameters increases with an increase 
in the number of parameters. This behaviour is illustrated 
by the red curve in Fig. 7. The first part message lengths 
are shown in red on the right side Y-axis in the figure. As 
the mixture model becomes increasingly more complex, 

the error of fitting the data decreases. This corresponds to the second part of the message in the MML encoding framework. 
This behaviour is consistent with what is observed in Fig. 7 (blue curve). There is a sharp fall until M = 3; then onwards 
increasing the model complexity does not lower the error significantly. The error saturates and there is minimal gain with 



Fig. 7 Variation of the individual parts of the total message length with in¬ 
creasing number of components (note the two Y-axes have different scales 
- the first part follows the right side Y-axis; the second part and total mes¬ 
sage lengths follow the left side Y-axis) 
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(g) 


(h) Before optimizing (I = 22617 bits) 


(i) Post-EM (/ = 22610 bits) 


Fig. 6 Third iteration: Operations involving the first component (a)-(c) denote the splitting process, (d)-(f) denote the deletion process, and (g)-(i) 
shows the merging of the first component with its closest component. 


regards to encoding the data (the case of overhtting). However, the model complexity dominates after M = 3. The optimal 
balance is achieved when M = 3. In summary, the message length at M = 3 components was rightly observed to be the 
optimum for this example. We note that for a hxed number of mixture components, the EM algorithm for the MML metric is 
monotonically decreasing. However, while searching for the number of components, MML continues to decrease until some 
optimum is found and then steadily increases as illustrated through this example. 


9 Experiments with Gaussian mixtures 

We compare our proposed inference methodology against the widely cited method of Figueiredo and Jain (2002). The per¬ 
formance of their method is compared against that of Bayesian Information Criterion (BIC), Integrated Complete Likelihood 
(ICL), and approximate Bayesian (LEC) methods (discussed in Section 7). It was shown that the method of Figueiredo and 
Jain (2002) was far superior than BIC, ICL and LEC (using Gaussian mixtures). In the following sections, we demonstrate 
through a series of experiments that our proposed approach to infer mixtures fares better when compared to that of Figueiredo 
and Jain (2002). The experimental setup is as follows: we use a Gaussian mixture (true distribution), generate a random 
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sample from it, and infer the mixture using the data. This is repeated 50 times and we compare the performance of our 
method against that of Figueiredo and Jain (2002). As part of our analysis, we compare the number of inferred mixture 
components as well as the quality of mixtures. 


9.1 Methodolgies used to compare the mixtures inferred by our proposed approach and FJ’s method 


Comparing message lengths: The MML framework allows us to objectively compare mixture models by computing the 
total message length used to encode the data. The difference in message lengths gives the log-odds posterior ratio of any two 
mixtures (Equation (19)). Given some observed data, and any two mixtures, one can determine which of the two best explains 
the data. Our search methodology uses the scoring function (Imml) defined in Equation (41). As elaborated in Section 7.5, 
Eigueiredo and Jain (2002) use an approximated MML-like scoring function (Ipj) given by Equation (46). 

We employ our search method and the method of Figueiredo and Jain (2002) to infer the mixtures using the same data; 
let the inferred mixtures be „#* and respectively. We compute two quantities: 

^Imml = 

and AIpj = (59) 

We use the two different scoring functions to compute the differences in message lengths of the resulting mixtures 
and . Since the search method used to obtain optimizes the scoring function Imml, it is expected that Imml{,^*) < 
Imml{,^^^) and consequently AImml > 0. This implies that our method is performing better using our defined objective 
function. However, if this indicates that our inferred mixture results in a lower value of the 

scoring function that is defined by Figueiredo and Jain (2002). Such an evaluation not only demonstrates the superior perfor¬ 
mance of our search (leading to ,/#*) using our defined scoring function but also proves it is better using the scoring function 
as defined by Figueiredo and Jain (2002). 

Kullback Leibler (KL) divergence: In addition to using message length based evaluation criterion, we also compare the 
mixtures using KL-divergence (Kullback and Leibler, 1951). The metric gives a measure of the similarity between two 
distributions (the lower the value, the more similar the distributions). For a mixture probability distribution, there is no 
analytical form to compute the metric. However, one can calculate its empirical value (which asymptotically converges to 
the KL-divergence). In experiments relating to mixture simulations, we know the true mixture from which the data 
{xi}j 1 < ; < M is being sampled. The KL-divergence is given by the following expression: 






Pr(x,.^') 
® Pr(x,..#) 


1 y Pr(x;,..#') 
Pr(x,-,,^) 


where is a mixture distribution (.^* or whose closeness to the true mixture is to be determined. 


(60) 


9.2 Bivariate mixture simulation 

An experiment conducted by Figueiredo and Jain (2002) was to randomly generate N = 800 data points from a two- 
component (with equal mixing proportions) bivariate mixture ,/#' whose means are at jUj = (0,0)^ and fl 2 = (5,0)^, 
and equal covariance matrices: Ci = C 2 = I (the identity matrix), and compare the number of inferred components. We 
repeat the same experiment here and compare with the results of Figueiredo and Jain (2002). The separation 5 between the 
means is gradually increased and the percentage of the correct selections (over 50 simulations) as determined by the two 
search methods is plotted. Fig. 8(a) shows the results of this experiment. As the separation between the component means is 
increased, the number of correctly inferred components increases. We conducted another experiment where we fix the sep¬ 
aration between the two components and increase the amount of data being sampled from the mixture. Fig. 8(b) illustrates 
the results for a separation of 5 = 2.0. As expected, increasing the sample size results in an increase in the number of correct 
selections. Both the search methods eventually infer the true number of components at sample size > 3500. We note that in 
both these cases, the differences in message lengths AImml and AIpj are close to zero. The KL-divergences for the mixtures 
inferred by the two search methods are also the same. Therefore, for this experimental setup, the performance of both the 
methods is roughly similar. 

As the difference between the two search methods is not apparent from these experiments, we wanted to investigate the 
behaviour of the methods with smaller sample sizes. We repeated the experiment similar to that shown in Fig. 8(a) but with 
a sample size oiN = 100. Our search method results in a mean value close to 1 for different values of 8 (see Table 1). The 
mean value of the number of inferred components using the search method of Figueiredo and Jain (2002) fluctuates between 
2 and 3. However, there is significant variance in the number of inferred components (see Table 1). These results are also 
depicted through a boxplot (Fig. 9). There are many instances where the number of inferred components is more than 3. The 
results indicate that the search method (FJ) is overfitting the data. 
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Fig. 8 2-dimensional mixture (a) Percentage of correct selections with varying separation for a fixed sample size of N — 800 (b) Average number 
of inferred mixture components with different sample sizes and a fixed separation of 5 = 2.0 between component means. 
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Fig. 9 Box-whisker plot showing the variability in the number of in¬ 
ferred components (A^ = 100 and over 50 simulations). 


Table 1 The mean and variance of the number of inferred 
components for each 5 value (N — 100 and over 50 simula¬ 
tions). 


Separation 

5 

Proposed 

FJ 

Mean 

Variance 

Mean 

Variance 

1.8 

1.02 

0.020 

1.98 

2.673 

1.9 

1.00 

0.000 

2.26 

2.482 

2.0 

1.00 

0.000 

2.04 

2.325 

2.1 

1.02 

0.020 

2.20 

3.510 

2.2 

1.04 

0.039 

2.20 

2.285 

2.3 

1.06 

0.057 

2.44 

3.639 

2.4 

1.06 

0.098 

2.54 

3.967 

2.5 

1.04 

0.039 

2.98 

3.203 

2.6 

1.10 
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2.42 

2.942 
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Further, we evaluate the correctness of the mix¬ 
tures inferred by the two search methods by compar¬ 
isons using the message length formulations and KL- 
divergence.Fig. 10 shows the boxplot of the difference in 
message lengths of the mixtures inferred using our 
proposed search method and the mixtures inferred 
using that of Figueiredo and Jain (2002). AImml > 0 
across all values of 5 for the 50 simulations. As per Equa¬ 
tion (59), we have This im¬ 

plies that has a lower message length compared to 
when evaluated using our scoring function. Simi¬ 
larly, we have Alpj < 0, i.e., This 

implies that has a lower message length compared 
to ./#* when evaluated using FJ’s scoring function. These 
results are not surprising as and are obtained 
using the search methods which optimize the respective 
MML and MML-like scoring functions. 

We then analyzed the KL-divergence of and 
with respect to the true bivariate mixture over 
all 50 simulations and across all values of 8 . Ideally, the 

KL-divergence should be close to zero. Fig. 11(a) shows the KL-divergence of the mixtures inferred using the two search 


2.3 


Separation 5 


Fig. 10 Bivariate mixture {N — 100): difference in message lengths com¬ 
puted using the two different scoring functions (see Equation (59)). 
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methods with respect to Ji' when the separation is 5 = 2.0. The proposed search method infers mixtures whose KL- 
divergence (denoted by red lines) is close to zero, and more importantly less than the KL-divergence of mixtures inferred by 
the search method of Figueiredo and Jain (2002) (denoted by blue lines). The same type of behaviour is noticed with other 
values of 5. Fig. 11(b) compares the KL-divergence for varying values of 5. The median value of the KL-divergence due 
to the proposed search method is close to zero with not much variation. The search method of Figueiredo and Jain (2002) 
always result in KL-divergence higher than that of ours. The results suggest that, in this case, mixtures inferred by 
employing the search method of Figueiredo and Jain (2002) deviate significantly from the true mixture distribution . This 
can also be explained by the fact that there is a wide spectrum of the number of inferred components (Fig. 9). This suggests 
that the MML-like scoring function is failing to control the tradeoff between complexity and quality of fit, and hence, is 
selecting overly complex mixture models. 



Fig. 11 Comparison of infemed mixtures using KL-divergence (bivariate example with W = 100 and 50 simulations) (a) Particular case of 5 = 2.0 
(b) For all values of 5 £ {1.8,... ,2.6}. 


9.3 Simulation of 10-dimensional mixtures 

Along the lines of the previous experiment, Figueiredo and Jain (2002) conducted another experiment for a 10-variate two- 
component mixture with equal mixing proportions. The means are at /ij = (0,... ,0)^ and /I 2 = (5,..., 5)^, so that the 
Euclidean distance between them is 5\/l0. The covariances of the two components are Ci = C 2 = I (the identity matrix). 
Random samples of size A = 800 were generated from the mixture and the number of inferred components are plotted. The 
experiment is repeated for different values of S and over 50 simulations. Fig. 12(a) shows the number of inferred components 
using the two search methods. At lower values of S, the components are close to each other, and hence, it is relatively more 
difficult to correctly infer the true number of components. We observe that our proposed method performs clearly better 
than that of Figueiredo and Jain (2002) across all values of S. We also compared the quality of these inferred mixtures by 
calculating the difference in message lengths using the two scoring functions and the KL-divergence with respect to 
. For all values of S, AImml > 0, i.e., our inferred mixtures have a lower message length compared to when 

evaluated using our scoring function. More interestingly, we also note that Alfj > 0 (see Fig. 13(a)). This reflects that 
have a lower message length compared to when evaluated using the scoring function of Figueiredo and Jain (2002). 
This suggests that their search method results in a sub-optimal mixture and fails to infer the better . 

In addition to the message lengths, we analyze the mixtures using KL-divergence. Similar to the bivariate example in 
Fig. 11(a), the KL-divergence of our inferred mixtures is lower than , the mixtures inferred by Figueiredo and 
Jain (2002). Fig. 13(b) shows the boxplot of KL-divergence of the inferred mixtures and At higher values of 

S >= 1.45, the median value of KL-divergence is close to zero, as the number of correctly inferred components (Fig. 12(a)) 
is more than 90%. However, our method always infers mixtures ./#* with lower KL-divergence compared to . These 
experimental results demonstrate the superior performance of our proposed search method. 

Another experiment was carried out where the S = 1.20 was held constant (extremely close components), gradually 
increased the sample size N, and plotted the average number of inferred components by running 50 simulations for each 
N. Fig. 12(b) shows the results for the average number of inferred components as the amount of data increases. Our search 
method, on average, infers the true mixture when the sample size is ~ 1000. However, the search method of Figueiredo and 
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Fig. 12 10-dimensional mixture: (a) Percentage of correct selections with varying S for a fixed sample size of N = 800 (separation between the 
means is S\/l0) (b) Average number of inferred mixture components with different sample sizes and S = 1.20 between component means. 
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Fig. 13 Comparison of mixtures with respect to the 10-variate true mixture and N = 800 (a) Difference in message lengths computed using the two 
scoring functions (b) Box-whisker plot of KL-divergence 


Jain (2002) requires larger amounts of data; even with a sample size of 2000, the average number of inferred components is 
~ 1.9. In Fig. 12(b), the red curve reaches the true number of 2 and saturates more rapidly than the blue curve. 


9.4 The impact of weight updates as formulated by Figueiredo and Jain (2002) 

One of the drawbacks associated with the search method of Figueiredo and Jain (2002) is due to the form of the updating 
expression for the component weights (Equation (47)). As discussed in Section 7.5.2, a particular instance of wrong inference 
is bound to happen when the net membership of a (valid) component is less than Np jl, where Np is the number of free 
parameters per component. In such a case, the component weight is updated as zero, and it is eliminated, effectively reducing 
the mixture size by one. 

We conducted the following experiment to demonstrate this behaviour: we considered the two-component 10-variate 
mixture as before and randomly generate samples of size 50 from the mixture. Since the constituent components of 
./#' have equal weights, on average, each component has a membership of 25. We used 5 = {10,100,1000}, so that the 
two components are well apart from each other. For each 5, we run 50 simulations and analyze the number of inferred 
components. As expected, the search method of Figueiredo and Jain (2002) always infer a mixture with one component 
regardless of the separation 8. Our method always infers the correct number of components. In order to test the validity of 
mixtures inferred by our proposed method, we analyze the resultant mixtures by comparing the message lengths as discussed 
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in Section 9.1. Fig. 14(a) shows the difference in message lengths AImml given in Equation (59). We observe that AImml > 0 



Fig. 14 Evaluation of the quality of inferred mixtures by comparing the difference in message lengths as computed using the two scoring functions. 
Positive difference indicates that the mixtures inferred by our search method have lower message lengths (see Equation (59)). 


for all 5. This demonstrates that our search based mixtures have lower message lengths compared to mixtures 
using our scoring function. The same phenomenon is observed when using the MML-like scoring function of Figueiredo 
and Jain (2002). In Fig. 14(b), we observe that AIpj > 0, which means our search based mixtures have lower message 
lengths compared to mixtures when evaluated using their scoring function. 

This demonstrates that is a better mixture as com¬ 
pared to and their search method is unable to infer it. We 
also note that the differences in message lengths increases with 
increasing 5. This is because for the one-component inferred 
mixture , the second part of the message (Equation (46)) 
which corresponds to the negative log-likelihood term increases 
because of poorer fit to the data. The two modes in the data be¬ 
come increasingly pronounced as the separation between con¬ 
stituent components of the true mixture increases, and hence, 
modelling such a distribution using a one-component mixture 
results in a poorer fit. This is clearly an incorrect inference. 

We further strengthen our case by comparing the KL- 
divergence of the inferred mixtures and with re¬ 

spect to the true mixture. Fig. 15 illustrates the results. As 5 in¬ 
creases, the blue coloured plots shift higher. These correspond 
to mixtures inferred by Figueiredo and Jain (2002). Our 
search method, however, infers mixtures which have lower 
KL-divergence. The figure indicates that the inferred mixtures 
are more similar to the true distribution as compared to 
mixtures . 

These experiments demonstrate the ability of our search method to perform better than the widely used method of 
Figueiredo and Jain (2002). We compared the resulting mixtures using our proposed MML formulation and the MML-like 
formulation of Figueiredo and Jain (2002), showing the advantages of the former over the latter. We also used a neutral 
metric, KL-divergence, to establish the closeness of our inferred mixtures to the true distributions. We will now illustrate the 
behaviour of our search method on two real world datasets. 
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Fig. 15 Box-whisker plot of KL-divergence of mixtures infen'ed 
by the two search methods. A random sample of size = 50 is 
generated for each 5 and this is repeated 50 times. 


9.5 Analysis of the computational cost 

At any intermediate stage of the search procedure, a current mixture with M components requires M number of split, delete, 
and merge operations before it is updated. Each of the perturbations involve performing an EM to optimize the corresponding 
mixture parameters. To determine the convergence of EM, we used a threshold of 10“^ which was the same as used by 
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Figueiredo and Jain (2002). FJ’s method also requires to start from an initial large number of components. We used 25 as 
an initial number based on what was suggested in Figueiredo and Jain (2002). We investigate the number of times the EM 
routine is called and compare it with that of Figueiredo and Jain (2002). We examine with respect to the simulations that 
were carried out previously. For the bivariate mixture discussed in Section 9.2, the number of resulting EM iterations when 
the sample sizes are N = 800 and N = 100 are compared in Fig. 16(a), (b) respectively. 


S 100 


Proposed 


nt: 


Separation 8 

(a) Bivariate mixture {N = 800) 


2.1 2.2 2.3 2.4 2.5 2.6 


Proposed - 


2.1 2.2 2.3 2.4 2.5 2.6 


Separation 8 

(b) Bivariate mixture {N = 100) 



(c) 10-variate mixture {N = 800) 
Fig. 16 Number of EM iterations performed during the mixture simulations discussed in Sections 9.2 and 9.3. 


As per the discussion in Section 9.2, at A = 800, the average number of components inferred by the two methods are 
about the same (Fig. 8(a)). However, the number of EM iterations required by FJ’s method is greater than 200 across all 
values of 5 (Fig. 16(a)). In contrast, the proposed method, on average, requires fewer than 50 iterations. In this case, both 
methods produce a similar result with FJ’s method requiring more number of EM iterations. When the bivariate mixture 
simulation is carried out using N = 100, the number of EM iterations required by FJ’s method, on average, is greater than 
100, while the proposed method requires fewer than 40 iterations (Fig. 16(b)). In this case, the proposed method not only 
infers better mixtures (as discussed in Section 9.2) but is also conservative with respect to computational cost. 

For the simulation results corresponding to the 10-variate mixtures in Section 9.3, the proposed method requires close to 
50 iterations on average, while FJ’s method requires about 20 (Fig. 16(c)). However, the mixtures inferred by the proposed 
method fare better when compared to that of FJ (Figs. 12, 13). Furthermore, for the simulation results explained in Sec¬ 
tion 9.4, FJ’s method stops after 3 EM iterations. This is because their program does not accommodate components when the 
memberships are less than Np/2. The proposed method requires 18 EM iterations on average and infers the correct mixture 
components. In these two cases, our method infers better quality mixtures, with no significant overhead with regard to the 
computational cost. 


9.6 Acidity data set (Richardson and Green, 1997; McLachlan and Peel, 1997) 


The first example is the univariate acidity data set which contains 155 points. Our proposed search method infers a mixture 
with 2 components whereas the search method of Eigueiredo and Jain (2002) infers a mixture with 3 components. 
The inferred mixtures are shown in Eig. 17 and their corresponding parameter estimates are given in Table 2. In order to 
compare the mixtures inferred by the two search methods, we compute the message lengths of the inferred mixtures using 
our complete MML and the approximated MML-like scoring functions. 

When evaluated using our MML scoring function, our inferred mixture results in a gain of ~ 4 bits (see Table 3). Based 
on the MML framework, our two-component mixture is 2^ times more likely than the three-component mixture 
(as per Equation (19)). Eurthermore, when the inferred mixtures are evaluated as per the MML-like scoring function, 
is still considered better (~ 298 bits) than (~ 320 bits). Thus, using both forms of scoring function, is the better 
mixture model of this data set. 
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Fig. 17 Mixtures inferred by the two search methods using the acidity data set. See Table 2 for the con'esponding parameter estimates. 


Component 

index 

Weight 

Parameters 

1 

0.34 

6.39, 0.17 

2 

0.35 

4.21,0.05 

3 

0.31 

4.71,0.36 


Component 

index 

Weight 

Parameters 

1 

0.41 

6.24, 0.28 

2 

0.59 

4.33,0.14 


(a) Proposed 


Table 2 The pai'ameters of the infened mixtures shown in Fig. 17 



Inferred mixtures 

Scoring 

functions 

Proposed 

FJ 

MML 

MML-like 

1837.61 

298.68 

1841.69 

320.02 


Table 3 Message lengths (measured in bits) of the 
mixtures (in Fig. 17) as evaluated using the MML and 
MML-like scoring functions. 


9.7 Iris data set (Anderson, 1935; Fisher, 1936) 

The second example is the popular Iris data set. The data is 4-dimensional and comes from three Iris species namely, Iris- 
setosa, Iris-versicolor, and Iris-virginica. The data size is 150 with each class (species) comprising of 50 representative 
elements. Our search method infers a 4 component mixture and the search method of Figueiredo and Jain (2002) infers 
a 3 component mixture (see Fig. 18). Table 4 shows the memberships of the 150 elements in each of the components 
in the inferred mixtures. We notice an additional component M4 in ,/#* which has a net membership of 9.51, that is ~ 6% 
of the entire data set. It appears that the component M2 in (Table 4(b)) is split into two components M2 and M4 in 
„#* (Table 4(a)). The quality of the inferred mixtures is determined by comparing their message lengths using the MML 
and MML-like scoring functions. Table 5 shows the values obtained using the two formulations. When evaluated using our 
complete MML formulation, our inferred mixture results in extra compression of ~ 1 bit, which makes it twice as likely 
as - it is a closely competing model compared to ours. When evaluated using the MML-like scoring function, our 
inferred mixture still has a lower message length compared to . In both the cases, the mixture inferred by our 
search method is preferred. 



Inferred mixtures 

Scoring 

functions 

Proposed 

FJ 

(„#'"■') 

MML 

MML-like 

6373.01 

323.31 

6374.27 

342.57 


Species 

Ml 

M2 

M3 

M4 


Species 

Ml 

M2 

M3 

setosa 

50 

0 

0 

0 


setosa 

50 

0 

0 

versicolor 

0 

5.64 

44.36 

0 


versicolor 

0 

5.55 

44.45 

virginica 

0 

40.29 

0.20 

9.51 


virginica 

0 

49.78 

0.22 


(a) Data distribution using 4 components (b) Data distribution using 3 components 


Table 5 Message lengths (measured in bits) 
of the mixtures (in Fig. 18) as evaluated us- 

Table 4 Memberships of Iris data as using the inferred mixtures in Fig. 18 (a) Distribution of and MML-like scoring func- 

data using (b) Distribution of data using tions 
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Principal component 1 



-2 I-*-*-*-*-*-*-*-*-> 

-4 -3 -2 -1 0 1 2 3 4 5 

Principal component 1 


(a) Mixture inferred using our proposed search method. 


(b) Mixture infen'ed using FJ’s search method. 


Fig. 18 Mixtures inferred by the two search methods using the Iris data set. The data is projected onto the two principal components. 


10 Experiments with von Mises-Fisher distributions 

We compare our MML-based parameter inference with the current state of the art vMF estimators (discussed in Section 3). 
Tests include the analysis of the MML estimates of the concentration parameter: is the approximation of MML estimate 

using Newton’s method and Kmh is the approximation using Halley’s method (see Equations (33) and (34)) against the 
traditionally used approximations. Estimation of the vMF mean direction is the same across all these methods. Estimation of 
K, however, differs and hence, the corresponding results are presented. Through these experiments, we demonstrate that the 
MML estimates perform better than its competitors. These are followed by experiments demonstrating how these estimates 
aid in the inference of vME mixtures. These experiments illustrate the application of the proposed search method to infer 
vME mixtures using empirical studies and on real world datasets. 


10.1 MML-based parameter estimation for a vMF distribution 

Eor different values of dimensionality d and concentration parameter K, data of sample size N are randomly generated from 
a vMF distribution using the algorithm proposed by Wood (1994). The parameters of a vME distribution are estimated using 
the previously mentioned approximations. Let k = {kt,Kn,Kh, Kmn, ^mh} denote the estimate of K due to the respective 
methods. 

Errors in K estimation: We first report the errors in K estimation by calculating the absolute error \ k— k\ and the squared 
error {k— k)^ averaged over 1000 simulations. The relative error can be used to measure the percentage error in K 
estimation. The following observations are made based on the results shown in Table 6. 

- For N = \0,d = 10, JC = 10, the average relative error of Kj, Kf^, Kh is ~ 25%; for Kmn, ^mh, it is ~ 20%. When N 

is increased to 100, the average relative error of Kj is 5.09%, is 5.05%, and Kmn,^mh is 4.9%. We note that 

increasing N while holding d and K reduces the error rate across all estimation methods and for all tested combinations 
of d, K. This is expected because as more data becomes available, the inference becomes more accurate. The plots shown 
in Figure 19 reflect this behaviour. The mean error at lower values of At = 5,10,20,30 is noticeable. However, as N is 
increased to 1000, there is a drastic drop in the error. We note that this behaviour is consistent across all the different 
estimation methods. 

- For fixed N and d, increasing K increases the mean absolute error. However, the average relative error decreases. As an 

example, for A = 100, J = 100, JC = 10, the average relative error of (Cj, Kjv, Kh is ~ 42%; for Kmni ^mh, it is 36.7% and 
34% respectively. When K is increased to 100, the error rate for Kt, Kjv, Kh drops to 2.18% and for Kmn, ^mh, it drops to 
1.68%. Further increasing K by an order of magnitude to 1000 results in average relative errors of 1.4% for Kh 

and 1.1% for Kmn, ^mh- This indicates that as the data becomes more concentrated, the errors in parameter estimation 
decrease. 

- There does not appear to be a clear pattern of the variation in error rates when d is changed keeping N and K fixed. 
However, in any case, MML-based approximations have the least mean absolute and mean squared error. 

KL-divergence and message lengths of the estimates: The quality of parameter inference is further determined by computing 
the KL-divergence and the message lengths associated with the parameter estimates. The analytical expression to calculate 
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Mean absolute error 

Mean squared error 

{N,cLk) 

Tanabe 

Sra 

Song 

MML 

Tanabe 

Sra 

Song 

MML 


Kt 

Kn 

Kh 

Kmn 

Kmh 

Kt 

Kn 

Kh 

Kmn 

Kmh 

10,10,10 

2.501e-l-0 

2.4866-hO 

2.486e-H0 

2.008e-H0 

2.0126-hO 

L009e-Hl 

9.9846-1-0 

9.9846-hO 

5.811e-H0 

5.8506-1-0 

10,10,100 

1.879e-i-l 

1.8776-hI 

1.877e-Hl 

1.316e-Hl 

1.316e-Hl 

5.930e-H2 

5.9206-1-2 

5.9206-H2 

2.800e-H2 

2.8026-1-2 

10,10,1000 

1.838e-i-2 

1.8386-H2 

1.838e-H2 

1.289e-H2 

1.289e-H2 

5.688e-H4 

5.6876-1-4 

5.6876-H4 

2.721e-H4 

2.7246-1-4 

10,100,10 

2.716e-i-l 

2.7166-hI 

2.716e-Hl 

2.7086-1-1 

1.728e-Hl 

7.464e-H2 

7.4646-1-2 

7.4646-H2 

7.414e-H2 

4.102e-H2 

10,100,100 

2.0146-1-1 

2.0146-hI 

2.014e-Hl 

1.2746-1-1 

1.265e-Hl 

4.543e-H2 

4.5436-1-2 

4.5436-H2 

2.069e-H2 

2.049e-H2 

10,100,1000 

1.2156-1-2 

1.2156-H2 

1.215e-H2 

3.8736-1-1 

3.870e-Hl 

L760e-H4 

1.7606-1-4 

1.7606-H4 

2.338e-H3 

2.337e-H3 

10,1000,10 

3.4156-1-2 

3.4156-H2 

3.415e-H2 

3.4156-1-2 

1.386e-H2 

L167e-H5 

1.1676-1-5 

1.1676-H5 

L167e-H5 

2.220e-H4 

10,1000,100 

2.7026-1-2 

2J02S+2 

2J02e+2 

2.7026-1-2 

1.652e-H2 

7.309e-H4 

7.3096-1-4 

7.3096-H4 

7.309e-H4 

3.101e-H4 

10,1000,1000 

1.9916-1-2 

1.9916-H2 

1.991e-H2 

1.2326-1-2 

1.222e-H2 

4.014e-H4 

4.0146-1-4 

4.0146-H4 

L570e-H4 

l.S47e-H4 

100,10,10 

5.0926-1 

5.0476-1 

5.0476-1 

4.906e-l 

4.906e-l 

4.097e-l 

4.0226-1 

4.0226-1 

3.717e-l 

3.717e-l 

100,10,100 

3.9216-1-0 

3.9156-hO 

3.915e-H0 

3.813e-H0 

3.813e-H0 

2.457e-Hl 

2.4506-1-1 

2.4506-hI 

2.278e-Hl 

2.278e-Hl 

100,10,1000 

3.7486-1-1 

3.7476-hI 

3.747e-Hl 

3.669e-Hl 

3.669e-Hl 

2.320e-H3 

2.3196-1-3 

2.3196-H3 

2.174e-H3 

2.174e-H3 

100,100,10 

4.2236-1-0 

4.2236-hO 

4.223e-H0 

3.6746-1-0 

3.414e-H0 

L862e-Hl 

1.8626-1-1 

1.8626-hI 

1.403e-Hl 

1.4206-1-1 

100,100,100 

2.1876-1-0 

2.1866-hO 

2.186e-H0 

1.683e-H0 

1.683e-H0 

7.071e-H0 

7.0676-1-0 

7.0676-hO 

4.395e-H0 

4.395e-H0 

100,100,1000 

1.4476-1-1 

1.4476-hI 

1.447e-Hl 

1.129e-Hl 

1.129e-Hl 

3.226e-H2 

3.2266-1-2 

3.2266-H2 

2.027e-H2 

2.027e-H2 

100,1000,10 

9.1506-1-1 

9.1506-hI 

9.150e-Hl 

9.1466-1-1 

8.2Sle-Hl 

8.377e-H3 

8.3776-1-3 

8.3776-H3 

8.370e-H3 

6.970e-H3 

100,1000,100 

4.2996-1-1 

4.2996-hI 

4.299e-Hl 

4.8826-1-1 

4.080e-Hl 

L856e-H3 

1.8566-1-3 

1.8566-H3 

2.659e-H3 

1.738e-H3 

100,1000,1000 

1.8336-1-1 

1.8336-hI 

1.833e-Hl 

8.821e-H0 

8.821e-H0 

3.728e-H2 

3.7286-1-2 

3.7286-H2 

1.060e-H2 

1.060e-H2 


Table 6 Errors in K estimation. The averages are reported over 1000 simulations for each (N,d, k) triple. 
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Fig. 19 Box-whisker plots illustrating the K estimates as the sample size is gradually increased. True distribution is a 10-dimensional vMF with 
K = 10. The plots are also indicative of the bias due to the estimates. 


the KL-divergence of any two vMF distributions is derived in the Appendix. The KL-divergence is computed between the 
estimated parameters and the true vMF parameters. The minimum message length expression for encoding data using a vMF 
distribution is previously derived in Equation (31). Table 7 lists the average values of both the metrics. The MML estimates 
of K result in the least value of KL-divergence across all combinations of N, d, K. Also, the message lengths associated with 
the MML based estimates are the least. From Table 7, we notice that when N = 10, Kmn and Kmh clearly have lower message 
lengths. For N = 10,<7 = 10, JC = 10, Kmn, ^mh result in extra compression of ~ 1.5 bits over Kj, K5v, which makes the 
MML estimates 2* '^ times more likely than the others (as per Equation (19)). 

Bias of the parameter estimates: The maximum likelihood estimate of K is known to have significant bias (Schou, 1978; 
Best and Eisher, 1981; Cordeiro and Vasconcellos, 1999). Our goal here is to demonstrate that MML-based parameter 
approximations result in estimates with reduced bias. The mean squared error in Table 6 can be decomposed into the sum of 
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Average KL-divergence | 

1 Average message length 

(N,d, k) 

Tanabe 

Sra 

Song 

MML 1 

Tanabe 

Sra 

Song 

MML 


Kt 

Kn 

Kh 

Kmn 

Kmh 

Kt 

Kn 

Kh 

Kmn 

Kmh 

10,10,10 

8.777e-l 

8.7506-1 

8.7506-1 

6.428e-l 

6.4456-1 

9.2856-1-2 

9.2856-H2 

9.2856-1-2 

9.269e-H2 

9.269e-H2 

10,10,100 

8.803e-l 

8.7986-1 

8.7986-1 

7.196e-l 

7.1996-1 

8.2146-1-2 

8.2146-H2 

8.2146-1-2 

8.208e-H2 

8.208e-H2 

10,10,1000 

9.006e-l 

9.0056-1 

9.0056-1 

7.443e-l 

7.4466-1 

6.9256-1-2 

6.9256-H2 

6.9256-1-2 

6.919e-H2 

6.919e-H2 

10,100,10 

8.5176-1-0 

8.5176-hO 

8.517e-H0 

8.4796-1-0 

5.321e-H0 

8.6336-1-3 

8.6336-H3 

8.6336-1-3 

8.6336-1-3 

8.58Se-H3 

10,100,100 

8.4446-1-0 

8.4446-hO 

8.444e-H0 

6.007e+0 

6.0096-hO 

8.4286-1-3 

8.4286-H3 

8.4286-1-3 

8.414e-H3 

8.414e-H3 

10,100,1000 

8.4726-1-0 

8.4726-hO 

8.472e-H0 

7.118e+0 

7.1206-hO 

7.2746-1-3 

7.2746-H3 

7.2746-1-3 

7.269e-H3 

7.269e-H3 

10,1000,10 

8.4336-1-1 

8.4336-hI 

8.433e-Hl 

8.4336-1-1 

1.777e-Hl 

7.0306-1-4 

7.0306-H4 

7.0306-1-4 

7.0306-1-4 

6.925e-H4 

10,1000,100 

8.4306-1-1 

8.4306-hI 

8.430e-Hl 

8.4306-1-1 

4.697e-Hl 

7.0306-1-4 

7.0306-H4 

7.0306-1-4 

7.0306-1-4 

6.989e-H4 

10,1000,1000 

8.4516-1-1 

8.4516-hI 

8.451e-Hl 

S.976e+1 

5.9776-hI 

6.8256-1-4 

6.8256-H4 

6.8256-1-4 

6.811e-H4 

6.811e-H4 

100,10,10 

7.4096-2 

7.3856-2 

7.3856-2 

7.173e-2 

7.173e-2 

9.11Se-H3 

9.115e-H3 

9.115e-H3 

9.115e-H3 

9.11Se-H3 

100,10,100 

7.5396-2 

7.5356-2 

7.5356-2 

7.411e-2 

7.411e-2 

7.858e-H3 

7.858e-H3 

7.858e-H3 

7.858e-H3 

7.858e-H3 

100 10,1000 

7.2716-2 

7.2716-2 

7.2716-2 

7.161e-2 

7.161e-2 

6.403e-H3 

6.403e-H3 

6.403e-H3 

6.403e-H3 

6.403e-H3 

100,100,10 

7.2706-1 

7.2706-1 

7.2706-1 

6.146e-l 

6.2086-1 

8.6156-1-4 

8.6156-H4 

8.6156-1-4 

8.614e-H4 

8.614e-H4 

100,100,100 

7.3576-1 

7.3576-1 

7.3576-1 

7.117e-l 

7.117e-l 

8.299e-H4 

8.299e-H4 

8.299e-H4 

8.299e-H4 

8.299e-H4 

100,100,1000 

7.3306-1 

7.3306-1 

7.3306-1 

7.210e-l 

7.210e-l 

6.976e-H4 

6.976e-H4 

6.976e-H4 

6.976e-H4 

6.976e-H4 

100,1000,10 

7.3246-1-0 

7.3246-hO 

7.324e-H0 

7.3186-1-0 

6.201e-H0 

7.0246-1-5 

7.0246-H5 

7.0246-1-5 

7.0246-1-5 

7.023e-H5 

100,1000,100 

7.3026-1-0 

7.3026-hO 

7.302e-H0 

7.045e+0 

7.1066-hO 

7.0226-1-5 

7.0226-H5 

7.0226-1-5 

7.019e-H5 

7.022e-H5 

100,1000,1000 

7.3406-1-0 

7.3406-hO 

7.340e-HO 

7.097e+0 

7.097e-H0 

6.707e-H5 

6.707e-H5 

6.707e-H5 

6.707e-H5 

6.707e-H5 


Table 7 Comparison of the K estimates using KL-divergence and message length formulation (both metrics are measured in bits). 


bias and variance terms as shown below (Taboga, 2012). 

mean squared error = E[(ic— k)^] = (E[jc] — (c)^ +E[(jc — Elic])^] 

Bias^(/i:) Variance (ic) 

where E[.] denotes the expectation of the related quantity. Table 8 shows the bias-variance of the estimated concentration 
parameter ic in the above simulations. The bias of Kmn and Kmh is lower compared to the other estimates. The variance of 
the MML estimates, however, is not always the least, as observed in Table 8. The combination of bias and variance, which is 
the mean squared error, is empirically demonstrated to be the least for the MML estimates. 



Bias (squared) | 

1 Variance 

(N,d, k) 

Tanabe 

Sra 

Song 

MML 1 

Tanabe 

Sra 

Song 

MML 


Kt 

Kn 

Kh 

Kmn 

Kmh 

Kt 

Kn 

Kh 

Kmn 

Kmh 

10,10,10 

5.6096-1-0 

5.5206-hO 

5.520e-H0 

1.2996-1-0 

1.269e-H0 

4.4766-1-0 

4.464e-H0 

4.464e+0 

4.5126-1-0 

4.581e-H0 

10,10,100 

2.2986-1-2 

2.2886-H2 

2.288e-H2 

4.9866-3 

2.577e-4 

3.6326-1-2 

3.6326-H2 

3.6326-1-2 

2.800e-H2 

2.802e-H2 

10,10,1000 

2.1576-1-4 

2.1566-H4 

2.156e-H4 

2.764e+l 

3.1936-hI 

3.5316-1-4 

3.5316-H4 

3.5316-1-4 

2.718e-H4 

2.720e-H4 

10,100,10 

7.3786-1-2 

7.3786-H2 

7.378e-H2 

7.3336-1-2 

2.875e-H2 

8.6606-1-0 

8.6606-hO 

8.6606-1-0 

8.066e-H0 

1.226e-H2 

10,100,100 

4.0546-1-2 

4.0536-H2 

4.053e-H2 

1.5466-1-2 

1.522e-H2 

4.894e-Hl 

4.894e-Hl 

4.894e-Hl 

5.2316-1-1 

5.273e-Hl 

10,100,1000 

1.4736-1-4 

1.4736-H4 

1.473e-H4 

2.2076-1-1 

1.994e-Hl 

2.8706-1-3 

2.8706-H3 

2.8706-1-3 

2.316e-H3 

2.317e-H3 

10,1000,10 

1.1666-1-5 

1.1666-H5 

1.166e-H5 

1.1666-1-5 

1.921e-H4 

8.090e-Hl 

8.090e-Hl 

8.090e-Hl 

8.090e-Hl 

2.983e-H3 

10,1000,100 

7.3016-1-4 

7.3016-H4 

7.301e-H4 

7.3006-1-4 

2.728e-H4 

8.6856-1-1 

8.6856-hI 

8.6856-1-1 

8.635e-Hl 

3.735e-H3 

10,1000,1000 

3.9646-1-4 

3.9646-H4 

3.964e-H4 

1.5176-1-4 

1.493e-H4 

4.969e-H2 

4.969e-H2 

4.969e-H2 

5.3066-1-2 

5.342e-H2 

100,10,10 

4.1296-2 

3.5286-2 

3.5286-2 

8.1326-3 

8.129e-3 

3.684e-l 

3.6696-1 

3.669e-l 

3.636e-l 

3.636e-l 

100,10,100 

1.2806-1-0 

1.2066-hO 

1.206e-H0 

5.5056-2 

5.504e-2 

2.3296-1-1 

2.3296-Hl 

2.3296-1-1 

2.273e-Hl 

2.273e-Hl 

100,10,1000 

9.7966-1-1 

9.7286-hI 

9.728e-Hl 

6.6206-1-0 

6.619e-H0 

2.2226-1-3 

2.2226-H3 

2.2226-1-3 

2.1686-1-3 

2.168e-H3 

100,100,10 

1.7836-1-1 

1.7836-hI 

1.783e-Hl 

4.661e+0 

6.2026-hO 

7.807e-l 

7.807e-l 

7.807e-l 

9.3696-1-0 

8.003e-HO 

100,100,100 

3.3716-1-0 

3.3676-hO 

3.367e-H0 

7.1476-1 

7.146e-l 

3.7006-1-0 

3.7006-hO 

3.7006-1-0 

3.681e-H0 

3.681e-H0 

100,100,1000 

1.1616-1-2 

1.1616-H2 

1.161e-H2 

3.504e-l 

3.504e-l 

2.0656-1-2 

2.0656-H2 

2.0656-1-2 

2.023e-H2 

2.023e-H2 

100,1000,10 

8.3726-1-3 

8.3726-H3 

8.372e-H3 

8.3646-1-3 

6.809e-H3 

5.3856-1-0 

5.3856-hO 

5.3856-1-0 

5.200e-H0 

1.614e-H2 

100,1000,100 

1.8486-1-3 

1.8486-H3 

1.848e-H3 

S.143e+2 

1.6286-H3 

7.656e-H0 

7.656e-H0 

7.656e-H0 

2.1456-1-3 

1.099e-H2 

100,1000,1000 

3.3596-1-2 

3.3596-H2 

3.359e-H2 

6.9266-1-1 

6.925e-Hl 

3.6926-1-1 

3.6926-hI 

3.6926-1-1 

3.674e-Hl 

3.674e-Hl 


Table 8 Bias-variance decomposition of the squared error (k — k)^. 


Statistical hypothesis testing: There have been several goodness-of-fit methods proposed in the literature to test the null 
hypothesis of a vMF distribution against some alternative hypothesis (Kent, 1982; Mardia et ah, 1984; Mardia and Jupp, 
2000). Recently, Figueiredo (2012) suggested tests for the specific case of concentrated vMF distributions. Here, we examine 
the behaviour of K estimates for generic vMF distributions as proposed by Mardia et al. (1984). They derived a likelihood 
ratio test for the null hypothesis of a vMF distribution (Hq) against the alternative of a Fisher-Bingham distribution {Ha). The 
asymptotically equivalent Rao’s score statistic (Rao, 1973) was used to test the hypothesis. 
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The score statistic W, in this case, is a function of the concentration parameter. It has an asymptotic X^{p) distribution 
(with degrees of freedom p = ^d{d + 1) — 1) under Hq as the sample size N ^ For d = {10,100,1000}, the critical 
values at 5% significance level are given in Table 9. If the computed test statistic exceeds the critical value, then the null 
hypothesis of a vMF distribution is rejected. We conduct a simulation study where we generate random samples of size = 1 
million from a vMF distribution with known mean and (C = {10,100,1000}. For each inferred estimate k, we compute the test 
statistic and compare it with the corresponding critical value. The results are shown in Table 9. For d=\Q, the approximation 
Kj has a significant effect as its test statistic exceeds the critical value and consequently the p-value is close to zero. This 
implies that the null hypothesis of a vMF distribution is rejected by using the estimate Kj. However, this is incorrect as the 
data was generated from a vMF distribution. The p-values due to the estimates Kn,Kh, Kmn, ^mh are all greater than 0.05 (the 
significance level) which implies that the null hypothesis is accepted. For d = (100,1000}, the p-values corresponding to the 
different estimates are greater than 0.05. In these cases, the use of all the estimates lead to the same conclusion of accepting 
the null hypothesis of a vMF distribution. As the amount of data increases, the error due to all the estimates decreases. This 
is further exemplified below. 



Critical 

Test statistic | 

1 p-value of the test | 

{d,K) 

value 

Tanabe 

Sra 

Song 

MML 1 

Tanabe 

Sra 

Song 

MML 1 


X^{p) 

Kt 

Kn 

Kh 

Kmn 

Kmh 

Kt 

Kn 

Kh 

Kmn 

Kmh 

10,10 

7.215e+l 

1.850e+2 

5.353e+l 

5.353e+l 

5.353e+l 

5.353e+l 

O.OOOe+O 

5.258e-l 

5.258e-l 

5.260e-l 

5.260e-l 

10,100 

7.215e+l 

1.698e+3 

4.949e+l 

4.949e+l 

4.945e+l 

4.945e+l 

O.OOOe+0 

6.247e-l 

6.247e-l 

6.267e-l 

6.267e-l 

10,1000 

7.215e+l 

1.950e+3 

4.811e+l 

4.81 le+1 

5.060e+l 

5.060e+l 

O.OOOe+0 

6.571e-l 

6.571e-l 

5.724e-l 

5.724e-l 

100,10 

5.215e+3 

5.090e+3 

5.090e+3 

5.090e+3 

5.090e+3 

5.090e+3 

3.739e-l 

3.739e-l 

3.739e-l 

3.741e-l 

3.741e-l 

100,100 

5.215e+3 

5.010e+3 

5.010e+3 

5.010e+3 

5.010e+3 

5.010e+3 

6.103e-l 

6.127e-l 

6.127e-l 

6.125e-l 

6.125e-l 

100,1000 

5.215e+3 

5.025e+3 

5.018e+3 

5.018e+3 

5.022e+3 

5.022e+3 

5.427e-l 

5.597e-l 

5.597e-l 

5.517e-l 

5.517e-l 

1000,10 

5.021e+5 

5.006e+5 

5.006e+5 

5.006e+5 

5.006e+5 

5.006e+5 

4.682e-l 

4.682e-l 

4.682e-l 

4.687e-l 

4.687e-l 

1000,100 

5.021e+5 

5.005e+5 

5.005e+5 

5.005e+5 

5.005e+5 

5.005e+5 

5.050e-l 

5.050e-l 

5.050e-l 

5.057e-l 

5.057e-l 

1000,1000 

5.021e+5 

5.007e+5 

5.007e+5 

5.007e+5 

5.007e+5 

5.007e+5 

4.283e-l 

4.283e-l 

4.283e-l 

4.196e-l 

4.196e-l 


Table 9 Goodness-of-fit tests for the null hypothesis Hq : vMF distribution and the alternate hypothesis Ha : Fisher-Bingham distribution. Critical 
values of the test statistic correspond to a significance of 5%. 


Asymptotic behaviour of MML estimates: Based on the empirical tests, we have so far seen that MML estimates fare better 
when compared to the other approximations. We now discuss the behaviour of the MML estimates in the limiting case. For 
large sample sizes {N —i- oo), we plot the errors in K estimation. Song et al. (2012) demonstrated that their approximation Kh 
results in the lowest error in the limiting case. We compute the variation in error in two scenarios with fixed d = 1000 and: 

1. increasing K: Fig. 20(a) illustrates the behaviour of the absolute error with increasing K. The first observation is that 
irrespective of the estimation procedure, the error continues to increase with increasing K values (which corroborates our 
observations in the empirical tests) and then saturates. According to Song et al. (2012), their estimate Kh produces the 
lowest error which we can see in the figure. Further, our MML Newton approximation Kmn actually performs worse than 
Song’s approximation Kh- However, we note that the errors due to MML Halley’s approximation Kmh are identical to 
those produced by Kh- This suggests that asymptotically, the approximations achieved by Kh and Kmh are more accurate 
(note that the errors in the limiting case are extremely low). 

2. increasing R: The maximum likelihood estimate of K aims to achieve F(k) = A^{k) — R = 0 (Equation 10). Hence, 
log \A^{k) — R\ gives a measure of the error corresponding to an estimate of K. Figure 20(b) depicts the variation of this 
error with increasing R . We observe that Kh and Kmh produce the least error. We also note that the error produced due 
to Kmn is greater than that produced by Kmh- However, we highlight the fact that MML-based parameter inference aims 
to achieve G{k) = 0 (Equation 32), a fundamentally different objective function compared to the maximum likelihood 
based one. 

The asymptotic results are shown here by assuming a value of N = lO^**® (note the corresponding extremely low error 
rates). In the limiting case, the MML estimate Kmh coincides with the ML estimate Kh- However, Kh’s performance is 
better compared to the MML Newton’s approximation Kmn- The same behaviour is observed for when K is fixed and the 
dimensionality is increased. For enormous amount of data, the ML approximations converge to the MML ones. 


10.2 MML-based inference of mixtures of vMF distributions 

An empirical study is carried out where the proposed search method is employed to infer a mixture using data sampled from a 
known mixture distribution. The amount of data is gradually increased; for each sample size N, the simulation is repeated 50 
times and the number of inferred components is plotted (we used MML Halley’s approximation in all the discussed results). 
The various case studies are discussed below. 
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(a) Variation in error with increasing K (b) Variation in error with increasing R 

Fig. 20 Errors in K estimation for d = 1000 as the sample size N 


1. The true components in the mixture have different mean directions (separated by an angle 6). 

2. The true components in the mixture have the same mean direction but different concentration parameters. 


Case 1: The true mixture has two components with equal mixing proportions. We consider the case when d = 3. The mean 
of one of the vMF components is aligned with the Z-axis. The mean of the other component is chosen such that the angle 
between the two means is 0°. Fig. 21(a) illustrates the scenario when the concentration parameters of the constituent com¬ 
ponents are different. Fig. 21(b) shows the variation in the number of inferred components when the true vMF components 
have the same concentration parameter. In both scenarios, as the angular separation is increased, the components become 
more distinguishable and hence, less amount of data is required to correctly identify them. 

When the concentration parameters of the constituent components are different, the inference of the mixture is relatively 
easier compared to the case when the concentration parameters are same. In Fig. 21(a), for all angular separations, the true 
number of components is correctly inferred at a sample size of = 200. When 9 = 20°, the search method converges faster 
at ~ 100 as compared to 0 = 5°, when the convergence is at A^ ~ 180. In Fig. 21(b), when 0 = 5°, the search method 
infers only one-component as the true mixture components are hardly distinguishable. When 0 = 10°, even at A^ ~ 1000, 
the average number of inferred components is ~ 1.8. When 0 = 15°, the search method converges at A^ ~ 300 as compared 
to A^ ~ 120 in Fig. 21(a). Clearly, when the component means are different, it is easier to correctly infer mixtures whose 
components have different concentration parameters. 




Fig. 21 Case 1: Average number of components inferred for the two-component mixture whose means are separated by 6°. 
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Case 2: In this case, multivariate (J = {2,3,10}) vMF mixtures with equal mixing proportions and same component means 
are considered. The simulation results of true mixtures with two and three components are presented here. 

Fig. 22(a) shows the average number of components inferred for a two-component mixture whose concentration param¬ 
eters are (Ci = 10 and = 100. For each value of d, as the sample size increases, the average number of inferred components 
gradually increases until it saturates and reaches the true value (2 in this case). Increasing the sample size beyond this does 
not impact the number of inferred mixture components. The results for a 3-component mixture with identical means and 
concentration parameters Ki = 10, K 2 = 100, and K '3 = 1000 are shown in Fig. 22(b). As expected, the average number of 
inferred components increases in light of greater evidence. However, we note that it requires greater amount of data to cor¬ 
rectly infer the three mixture components as compared to the two-component case. In the two-component case (Fig. 22(a)), 
at around N = 450, all the three curves converge to the right number of components. For the three-component mixture in 
Fig. 22(b), up until N = 500, there is no convergence for d = 2,3. For d = 10, the average number of inferred components 
converges quickly (at N ~ 25) for the 2-component mixture when compared with N ~ 100 for the 3-component mixture. 
This is expected as correctly distinguishing three components (with same means) requires far more data. 




Fig. 22 Case 2: Average number of components inferred when the true mixture has components with the same mean directions but different 
concentration parameters, (a) 2-component mixture (Ki = 10, K 2 = 100) (b) 3-component mixture (ifi = 10, K 2 = 100, K '3 = 1000) 


It IS also interesting to note that for a = 2 m Fig. 22(b), the average number of inferred components saturates at 2, while 
the actual number of mixture components is 3. However, as the amount of available data increases, the (almost) horizontal 
line shows signs of gradual increase in its slope. Fig. 23 shows the increase in the average number for t/ = 2,3 as the data is 
increased beyond N = 500. The blue curve representing d = 3 stabilizes at IV ~ 2500. However, the red curve (d = 2) slowly 
starts to move up but the average is still well below the true number. This demonstrates the relative difficulty in estimating 
the true mixture when the means coincide especially at lower dimensions. 


One can imagine that when the means coincide, it would 
require greater amount of data to accurately infer the true mix¬ 
ture components. However, the amount of data required also 
depends on the dimensionality in consideration. It appears that 
as the dimensionality increases, we need smaller amounts of 
data (as can be seen for the t/ = 10 case). In (i-dimensional 
space, each datum comprises of d real values with the con¬ 
straint that it should lie on the unit hypersphere. So the esti¬ 
mation of the mean direction requires the estimation of (t/ — 1) 
values and one value for K. When there is N data, we actu¬ 
ally have Hd = N X {d — \) values available for estimating the 
d free parameters. For instance, given a sample of size N, for 
d = 2,n2=N and for d = 10, «io = 9N. We conjecture that this 
could be a possible reason for faster convergence in high di¬ 
mensional space. Through these experiments, we have demon¬ 
strated the ability of our search method to infer appropriate 
mixtures in situations with varying difficulty levels. 



Fig. 23 Gradual increase in the average inferred components for 
the 3-component mixture in Fig. 22(b). 
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11 Applications of vMF mixtures 

11.1 Application to text clustering 


The use of vMF mixtures in modelling high dimensional text 

data has been investigated by Banerjee et al. (2005). To compute the similarity between text documents requires their rep¬ 
resentation in some vector form. The elements of the vectors are typically a function of the word and document frequencies 
in a given collection. These vector representations are commonly used in clustering textual data with cosine based similarity 
metrics being central to such analyses (Strehl et al., 2000). There is a strong argument for transforming the vectors into 
points on a unit hypersphere (Salton and McGill, 1986; Salton and Buckley, 1988). Such a normalized representation of text 
data (which compensates for different document lengths) motivates their modelling using vMF mixture distributions. 

Banerjee et al. (2005) used their proposed approximation (Equation (12)) to estimate the parameters of a mixture with 
known number of components. They did not, however, propose a method to search for the optimal number of mixture 
components. We not only derived MML estimates which fare better compared to the previous approximations but also 
apply them to devise a search method to infer the optimal mixtures. Ideally, the search is continued until there is no further 
improvement in the message length (Algorithm 1). For practial purposes, the search is terminated when the improvement due 
to the intermediate split, delete and merge operations during the search process is less than 0.01%. Our proposed method to 
infer mixtures was employed on the datasets that were used in the analysis by Banerjee et al. (2005). The parameters of the 
intermediate mixtures are estimated using the MML Halley’s estimates (Equation (34)) for the component vMF distributions. 
Banerjee et al. (2005) use mutual information (MI) to assess the quality of clustering. For given cluster assignments X and 

Pr(A,7) 


the (known) class labels 7, MI is defined as: E 


log; 


. Along with the message lengths, we use MI as one other 


’Pr(A)Pr(7)_ 

evaluation criterion in our analysis. We also compute the average F-measure when the number of clusters is same as the 
number of actual classes. 

For each of the datasets, in the preprocessing step, we generate feature vectors using the most frequently occuring words 
and generating a TF-IDF score for each feature (word) based on Okapi BM25 score (Robertson and Zaragoza, 2009). These 
feature vectors are then normalized to generate unit vectors in some d-dimensional space. Using this as directional data on a 
hypersphere, a suitable mixture model was inferred using the greedy search proposed in Section 8. 


11.1.1 Classics dataset 

The dataset^ contains documents from three distinct categories: 1398 Cranfield (aeronautical related), 1033 Medline (medical 
journals) and 1460 Cisi (information retrieval related) documents. The processed data has d = 4358 features. 

Optimal number of clusters: In this example, it is known that there are three distinct categories. However, this information 
is not usually known in real world setting (and we do not know if they are from three vMF distributions). Assuming no 
knowledge of the nature of the data, the search method infers a mixture with 16 components. The corresponding assignments 
are shown in Table 10. A closer look at the generated assignments illustrate that each category of documents is represented by 
more than one component. The three categories are split to possibly represent specialized sub-categories. The Cisi category 
is distributed among 6 main components (M4 - M9). The Cranfield documents are distributed among M6, MIO - M15 
components and the Medline category is split into MO - M3, and M6 components. We observe that all but three components 
are non-overlapping; only M6 has representative documents from all three categories. 



MO 

Ml 

M2 

M3 

M4 

M5 

M6 

M7 

M8 

M9 

MIO 

Mil 

M12 

M13 

M14 

M15 

cisi 

0 

0 

4 

0 

288 

133 

28 

555 

197 

255 

0 

0 

0 

0 

0 

0 

cran 

0 

0 

0 

0 

2 

0 

362 

1 

0 

0 

58 

144 

135 

175 

223 

298 

med 

9 

249 

376 

138 

2 

0 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 


Table 10 Confusion matrix for 16-component assignment (MML Halley). 


The 16-component mixture inferred by the search method is a finer segregation of the data when compared to modelling 
using a 3-component mixture. The parameters of the 3-component mixture are estimated using an EM algorithm (Sec¬ 
tion 6.2). Table 11 shows the confusion matrices obtained for the cluster assignments using the various estimation methods. 
We see that all the estimates perform comparably with each other; there is not much difference in the assignments of data to 
the individual mixture components. 

The collection is comprised of documents that belong to dissimilar categories and hence, the clusters obtained are wide 
apart. This can be seen from the extremely high F-measure scores (Table 12). For the 3-component mixture, all the five 
different estimates result in high F-measure values with Song being the best with an average F-measure of 0.978 and a MI 

http://www.dataminlngresearch.com/index.php/2010/09/classics-classic4-datasets/ 
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cisi 

cran 

med 

cisi 

1449 

0 

11 

cran 

24 

1331 

43 

med 

13 

0 

1020 



cisi 

cran 

med 

cisi 

1450 

0 

10 

cran 

24 

1331 

43 

med 

13 

0 

1020 



cisi 

cran 

med 

cisi 

1450 

0 

10 

cran 

24 

1339 

35 

med 

14 

0 

1019 



cisi 

cran 

med 

cisi 

1441 

0 

19 

cran 

22 

1293 

83 

med 

8 

0 

1025 


(a) Banerjee 


(b) Tanabe 


(c) Song 


(d) MML (Halley) 


Table 11 Confusion matrices for 3-cluster assignment. (Sra’s confusion matrix is omitted as it is same as that of Tanabe) 


of 0.982. MML (Halley’s) estimate are close with an average F-measure of 0.976 and a MI of 0.976. However, based on the 
message length criterion, the MML estimate results in the least message length 190 bits less than Song’s). The mutual 
information score using MML estimate is 1.04 (for 16 components) compared to 0.976 for 3 components. Also, the message 
length is lower for the 16 component case. However, Song’s estimator results in a MI score of 1.043, very close to the score 
of 1.040 obtained using MML estimates. 


Number of clusters 

Evaluation metric 

Banerjee 

Tanabe 

Sra 

Song 

MML (Halley) 

3 

Message length 
Avg. F-measure 
Mutual Information 

100678069 

0.9644 

0.944 

100677085 

0.9758 

0.975 

100677087 

0.9758 

0.975 

100677080 

0.9780 

0.982 

100676891 

0.9761 

0.976 

16 

Message length 
Mutual Information 

100458153 

1.029 

100452893 

1.036 

100439983 

0.978 

100444649 

1.043 

100427178 

1.040 


Table 12 Clustering performance on Classic3 dataset. 


For the Classic3 dataset, Banerjee et al. (2005) analyzed mixtures with greater numbers of components than the “natural” 
number of clusters. They report that a 3-component mixture is not necessarily a good model and more number of clusters 
may be preferred for this example. As part of their observations, they suggest to “generate greater number of clusters and 
combine them appropriately”. However, this is subjective and requires some background information about the likely number 
of clusters. Our search method in conjunction with the inference framework is able to resolve this dilemma and determine 
the optimal number of mixture components in a completely unsupervised setting. 

11.1.2 CMU-Newsgroup 

The dataset^ contains documents from 20 different news categories each containing 1000 documents. Preprocessing of 
the data, as discussed above, resulted in feature vectors of dimensionality d = 6448. The data is first modelled using a 
mixture containing 20 components. The evaluation metrics are shown in Table 13. The average F-measure is 0.509 for MML- 
based estimation, slightly better than Banerjee’s score of 0.502. The low F-measure values are indicative of the difficulty in 
accurately distinguishing the news categories. The mutual information score for MML case is 1.379 which is lower than that 
of Sra’s. However, the total message length is lower for MML mixture compared to that of others. 


Number of clusters 

Evaluation metric 

Banerjee 

Tanabe 

Sra 

Song 

MML (Halley) 

20 

Message length 
Avg. F-measure 
Mutual Information 

728666702 

0.502 

1.391 

728545471 

0.470 

1.383 

728585441 

0.487 

1.417 

728536451 

0.435 

1.244 

728523254 

0.509 

1.379 

21 

Message length 
Mutual Information 

728497453 

1.313 

728498076 

1.229 

728432625 

1.396 

728374429 

1.377 

728273820 

1.375 


Table 13 Clustering performance on CMU_Newsgroup dataset. 


Optimal number of clusters: The proposed search method when applied to this dataset infers a mixture with 21 components. 
This is close to the “true” number of 20 (although there is no strong reason to believe that each category corresponds to a 
vMF component). The mutual information for the 21-cluster assignment is highest for Sra’s mixture with a score of 1.396 
and for MML mixture, it is 1.375 (Table 13). However, the total message length is the least for the MML-based mixture. 

The analysis of vMF mixtures by Banerjee et al. (2005) for both the datasets considered here shows a continued increase 
in the MI scores even beyond the true number of clusters. As such, using the MI evaluation metric for different number of 
mixture components does not aid in the inference of an optimal mixture model. Our search method balances the tradeoff 

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups 
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between using a certain mixture and its ability to explain the observed data, and thus, objectively aids in inferring mixtures 
to model the normalized vector representations of a given collection of text documents. 

A mixture modelling problem of this kind where there is some information available regarding the nature of the data can 
be studied by altering the proposed search method. We provide some alternate strategies where the mixture modelling can 
be done in a semi-supervised setting. 

- The priors on the number of components and their parameters can be modelled based on the background knowledge. 

- If the true number of clusters are known, only splits may be carried out until we near the true number (each split being 
the best one given the current mixture). As the mixture size approaches the true number, all the three operations (split, 
delete, and merge) can be resumed until convergence. This increases the chance that the inferred mixture would have 
about the same number of components as the true model. 

- Another variant could be to start from a number close to the true number and prefer delete/merge operations over the 
splits. We cannot ignore splits completely because a component after splitting may be merged at a later iteration if there 
would be an improvement to the message length. 

- Another strategy could be to employ the EM algorithm and infer a mixture with the true number of components. This 
mixture can then be perturbed using split, delete, and merge operations until convergence. 


11.2 Mixture modelling of protein coordinate data 


The following application concerns the vMF mixture modelling of directional data arising from the orientation of main chain 
carbon atoms in protein structures. The structures that proteins adopt are largely dictated by the interactions between the con¬ 
stituent atoms. These chemical interactions impose constraints on the orientation of atoms with respect to one another. The 
directional nature of the protein data and the (almost constant) bond length between the main chain carbon atoms motivate 
the modelling using vMF mixtures. Further, structural modelling tasks such as generating random protein chain confor¬ 
mations, three-dimensional protein structure alignment, secondary structure assignment, and representing protein folding 
patterns using concise protein fragments require efficient encoding of protein data (Konagurthu et al., 2012, 2013; Collier 
et al., 2014). As part of our results, we demonstrate that vMF mixtures offer a better means of encoding and can potentially 
serve as strong candidate models to be used in such varied tasks. 

The dataset considered here is a collection of 8453 non-redundant experimentally determined protein structures from the 
publicly available ASTRAL SCOP-40 (version 1.75) database (Murzin et al., 1995). For each protein structure, the coordi¬ 
nates of the central carbon, Ca, of successive residues (amino acids) are considered. Protein coordinate data is transformed 
into directional data and each direction vector is characterized by = (co-latitude, longitude), where d e [0,180°] and 

(j) e [0,360°]. Note that these values have to be measured in a consistent, canonical manner. 

To compute corresponding to the point /j+i associ¬ 

ated to residue i -|- 1, we consider this point in the context of 
3 preceding points, forming a 4-mer comprising of the points 
Pi- 2 ,Pi-i,Pi, and ^■+1 . This 4-mer is orthogonally transformed 
into a canonical orientation (Fig. 24) in the following steps: 

- translate the 4-mer such that is at the origin. 

- rotate the resultant 4-mer so that i lies on the negative 
X-axis. 

- rotate further so that /)_2 lies in the XY plane such that 
the angle between the vector Pi _2 “ Pj-i and the positive 
Y-axis is acute. 



Fig. 24 Canonical orientation used to determine the directional 
data corresponding to protein coordinates. 


The transformation yields a canonical orientation for Pi+i with 
respect to its previous 3 coordinates. Using the transformed co¬ 
ordinates of Pi+i, the direction of is computed. We 

repeat this transformation for every successive set of 4-mers 
in the protein chain, over all possible source structures in our 
collection. The data collected in this way resulted in a total 
of ~ 1.3 million pairs for all the 8453 structures in the 

database. 

Protein data is an example where the number of mixture components are not known a priori. Hence, we use the method 
outlined in Section 8 to infer suitable mixture models. The original dataset comprises of 7 different categories of proteins. 
The proposed search method using MML (Halley’s) estimates infers a mixture containing 13 vMF components. Further, 
each protein category can be individually modelled using a mixture. As an example, for the “fi class” proteins which con¬ 
tains 1802 protein structures and 251,346 corresponding pairs, our search method terminates after inferring 11 vMF 
components. We compare the MML-based mixtures with those inferred by the standalone EM algorithm (Section 6.2) using 
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Other estimates. These values are presented in Table 14. We observe that the mixtures inferred using the MML estimates 
result in a message length lower than that obtained using the other estimates. 

Fig. 25(a) shows the empirical distribution of directional data of Ca coordinates belonging to {3 class. The {d,(j)) values 
with their corresponding frequencies are plotted in Fig. 25(a). Fig. 25(b) is a plot illustrating the 11-component vMF mixture 
density as inferred for this class of proteins. Notice that the two major modes in the figure correspond to commonly observed 
local secondary structural bias of residues towards, helices and strands of sheet. Also notice the empty region in the middle 
which corresponds to physically unrealizable directions in the local chain, excluded in the observed samples due to steric 
hindrance of the atoms in proteins. If we were to model such data using truncated distributions, the regions of zero probability 
will be modelled using an infinite code length. As an example, at (0,(^) = (100°, 200°), the truncated distribution would have 
zero probability and consequently an infinite code length. However, when the same point is explained using the 11-component 
mixture, it would have a probability of Pr = 3.36 x 10^*^ and a corresponding code length of —log 2 Pr = 38.11 bits. For 
protein data, it is possible to have such (rare exceptional) observations, due to reasons such as experimental error, noise, or 
the conformation of the protein itself. Hence, although the empirical distribution has distinct modes, it is better off modelled 
as a vMF mixture distribution, rather than by truncated distributions. 


2.5 



Fig. 25 Distribution of directional data of (3 class Ca atoms (a) Empirical distribution (b) Mixture density corresponding to the 11 inferred vMF 
components 


Compressibility of protein structures: The explanatory framework of MML allows for testing competing hypotheses. Re¬ 
cently, Konagurthu et al. (2012) developed a null model description of protein coordinate data as part of the statistical 
inference of protein secondary structure. A null model gives a baseline for transmitting the raw coordinate data. Each Ca 
atom is described using the distance and orientation with respect to the preceding Ca atoms. Because the distance between 
successive Ca atoms is highly constrained, compression can only be gained in describing the orientation of a Ca atom with 
respect to its previous one. 

Konagurthu et al. (2012) describe their null hypothesis by discretizing the surface of a 3D-sphere into chunks of equal 
areas (of e^, where e is the accuracy of measurement of coordinate data). This results in Attr^ cells distributed uniformly 
on the surface of the sphere of radius r (the distance between successive Ca coordinates). The cells are uniquely numbered. 
To encode C'a^ with respect to C'a, the location of C'a^ on the surface is identified and the corresponding cell Index is 
encoded. Using this description, the stated null model results in a message length expression Konagurthu et al. (2012) given 

by ^ 

Uniform Null =—log 2 = log2(4;r) — 2log 2 bits. (61) 

The null model of Konagurthu et al. (2012) assumes a uniform distribution of orientation angles on the surface of the 
sphere. However, this is a crude assumption and one can leverage the directional properties of protein coordinates to build 
an efficient null model. To this effect, we explore the use of vMF mixtures as null model descriptors for protein structures. 
Using vMF mixtures, we encode the co-latitude (0) and longitude {(j)) angles (described in Section 11.2). The message length 
expression to encode the orientation angles (using Equation 1) is then given by 

vMFNull = -log 2 (^Y.wjfj{x\0j)^ -21og2 


bits. 


(62) 
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where x corresponds to a unit vector described by {9,(j)) on the surface of the sphere. Equations (61) and (62) are two 
competing null models. These are used to encode the directional data corresponding to the 8453 protein structures in the 
ASTRAL SCOP-40 database. The results are shown in Table 15. 


Category 

Tanabe 

Sra 

Song 

MML (Halley) 

p 

5514800 

5518679 

5520073 

5513507 

All 

27818524 

27833704 

27839802 

27803427 


Null model 

Total message length (in bits) 

Bits per residue 

Uniform 

36119900 

27.437 

vMF 

32869700 

24.968 


Table 14 Message lengths (in bits) computed for the inferred protein mix¬ 
tures using various methods (‘All’ refers to all the 7 protein categories). 


Table 15 Comparison of the uniform and vMF null model encod¬ 
ing schemes. 


The per residue statistic is calculated by dividing the total message length by the sample size (the number of {6, (j)) pairs). 
This statistic shows that close to 2.5 bits can be saved (on average) if the protein data is encoded using the vMF null model. 
The vMF null model thus supercedes the naive model of encoding. This can potentially improve the accuracy of statistical 
inference that is central to the various protein modelling tasks briefly introduced above. 


12 Conclusion 

We presented a statistically robust approach for inferring mixtures of (i) multivariate Gaussian distributions, and (ii) von 
Mises-Fisher distributions for tZ-dimensional directional data. It is based on the general information-theoretic framework 
of minimum message length inference. This provides an objective tradeoff between the hypothesis complexity and the 
quality of fit to the data. An associated search procedure for an optimal mixture model of given data chooses the number of 
component distributions, M, by minimizing the total message length. We established the better performance of the proposed 
search algorithm by comparing with a popularly used search method (Figueiredo and Jain, 2002). We demonstrated the 
effectiveness of our approach through extensive experimentation and validation of our results. We also applied our method 
to real-world high dimensonal text data and to directional data that arises from protein chain conformations. The experimental 
results demonstrate that our proposed method fares better when compared with the current state of the art techniques. 
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13 Appendix 

13.1 Supporting derivations required for evaluating Kmn and Kmh 

For brevity, we represent Aj(ic),Aj(k'),Aj(k'), andAj(K') as A,A',A", and A'" respectively. Expressions to evaluate A,A', 
and A" are given in Equations (11), (15), and (17) respectively. We require A'" for its use in the remainder of the derivation 
and we provide its expression below: 
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Now we discuss the derivation of G'{k) and G"{k) that are required for computing the MML estimate Km (Equations 
(33) and (34)). Differentiating Equation 32, we have 
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Using Equations (11) and (15), we have 
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Using Equations (11), (15), (17), and (63) we have 
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where 


and 
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Equations (66) and (67) can be used to evaluate G'{x) and G"{x) which can then be used to approximate the MML 
estimates Xmn and Xmh (Equations (33) and (34) respectively). 
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13.2 Derivation of the weight estimates in MML mixture modelling 

M N M 

As per Equation (41), we have I{0,D) = log ^ Wjfj{xi',0j) + terms independent of wj 

2y=i i=i >=i 

To obtain the optimal weights under the constraint the above equation is optimized using the Lagrangian 

objective function defined below using some Lagrangian multiplier X. 

( M 

For some k 6 {1,A/}, the equation resulting from computing the partial derivative of L with respect to Wk and equating it to 
zero gives the optimal weight wj.. 
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We have 
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where and are the responsibility and effective membership terms given as per Equations (36) and (37) respectively. 
Substituting the above value in Equation (68), we have 
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1 ^nk 
2wk Wk 
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(69) 


There are M equations similar to Equation (69) for values of k E {1, M}. Adding all these equations together, we get 
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Substituting the above value of X in Equation (69), we get wj. 


nk+l 

A+f 


13.3 Derivation of the Kullback-Leibler (KL) distance between two von Mises-Fisher distributions 


The closed form expression to calculate the KL divergence between two vMF distributions is presented below. Let /(x) = 
CrfjjCi * and g(x) = be two von Mises-Fisher distributions with mean directions jUj ,/i 2 and concentration 

parameters (Ci, K^. The KL distance between any two distributions is given by 
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where Ey[.] is the expectation of the quantity [.] using the probability density function /. 
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Using the fact that Ey[x] = Ad{Ki)fli (Mardia et al., 1984; Fisher, 1993), we have the following expression: 
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13.4 An example showing the evolution of the mixture model 


We consider an example of a mixture with overlapping components and 
employ the proposed search method to determine the number of mixture 
components. Figueiredo and Jain (2002) considered the mixture shown in 
Fig. 26 which is described as follows: let e {1,4} be the weight, 

mean, and covariance matrix of the component respectively. Then, the 
mixture parameters are given as 

w\=W2 = w^ = 0.3, W4 = 0.1 
Ml = M2 = (-4,-4)^, Ms = (2,2)^, M4 = (-1,-6)^ 
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We generate a random sample of size 1000 (similar to Figueiredo and 
Jain (2002)) and infer mixtures using the proposed heuristic. Below are di¬ 
agrams which portray the splitting of the individual components. We show 
a selection of operations which result in improved mixtures. 



Fig. 26 Original mixture consisting of overlapping 
components. 



(a) P \: one-component mixture (b) Prior to splitting: initialization of means 


(c) P 2 '. after EM optimization 


Fig. 27 First iteration (a) Pi: the one-component mixture (/ = 27199 bits) (b) Red colour denotes the component being split. The dotted line is the 
direction of maximum variance. The black dots represent the initial means of the two-component sub-mixture (c) P 2 '. optimized mixture post-EM 
(/ = 26479 bits) results in an improvement. 



(a) Splitting a component in P 2 


(b) Optimized child sub-mixture (c) P 3 : post-EM improved mixture 


Fig. 28 Second iteration: splitting the (red) component in parent P 2 (/ = 26479 bits) results in an improvement (a) Initial means of the child 
components (b) Optimized child mixture (denoted by red dashed lines) along with the second component of P 2 (denoted by black dashes) (/ = 26467 
bits) (c) Py. stabilized mixture post-EM (7 = 26418 bits) results in an improvement. 
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(d) Splitting a component in P 3 (e) Optimized children (/ = 26451 bits) (f) P 4 : also an improved mixture (/ = 26266 bits) 

Fig. 29 Third iteration: splitting the (red) components in parent P 3 (/ = 26418 bits). We see that there are two splits which result in improved 
mixtures P'^ and P 4 . We select P 4 as the new parent as it has a lower message length compared to P 4 . 



(c) I = 26283 bits 


(d) / = 26279 bits 


Fig. 30 Fourth iteration: mixtures resulting from splitting each of the components in parent P 4 (/ = 26266 bits). We see that none of the 5-component 
mixtures result in further reduction of the message length. The search terminated and P 4 is considered the best mixture. 






















